1
17%
master: 17%

Ran 23 Aug 2016 07:10AM UTC

Files 363

Run time 15s

Badge

Committed 23 Aug 2016 06:59AM UTC coverage: 94.535% (+0.003%) from 94.532%

Job # DISTRIB="ubuntu" PYTHON_VERSION="2.7" CYTHON_VERSION="0.23.4" CACHED_BUILD_DIR="$HOME/sklearn_build_ubuntu" COVERAGE=true

Build Type

push

travis-ci

Committed by

jnothman

Commit Message

[MRG] Remove heavy memory footprint in BaseBagging due to OOB scoring (#7118)

* Remove heavy memory footprint for OOB scoring from bagging.

- Remove `estimators_samples` attribute from `BaseBagging`
- Add method `_get_estimators_samples` to `BaseBagging` that
  returns a generator producing sample indices on demand.
- Slight refactor of `_parallel_build_estimators()` to
  isolate and group lines accessing random state.

* Replaced `BaseBagging.estimators_samples_` attribute with property.

- Accessing `BaseBagging.estimators_samples_` now triggers call to
  `BaseBagging._get_estimators_samples` as well as deprecation
  warning for release 0.20.
- This should make the fix for the OOB memory issue fully backwards
  compatible.

* Remove print statement.

* Added test, made new method more general.
- Added test to `test_bagging.py` to ensure that indices generated
  on the fly are identical to indices generated at fit time.
- Refactored `_get_estimators_samples()` to `_get_estimators_indices()`,
  now both feature and samples indices are returned.
- Refactored `estimators_samples_()` to deal with above.

* Rename index generating functions and arguments.

* `estimators_samples_()` returns list instead of generator

* Removed `estimators_samples_` deprection warning.

* Actually removed `estimators_samples_` deprecation warning.

* New mask generation function, new bagging test, new BaseBagging attributes.
- Added new private function in bagging.py that converts indices to
  a boolean mask.
- Added new bagging test to make sure identical OOB scores are generated
  when the same estimator if fit with fixed random state and identical
  training data.
- Added new private attributes `BaseBagging._max_features` and
  `BaseBagging._max_samples` to store validated input values.

* Streamlined code, improved `estimators_samples_()` documentation.
- Removed `max_samples` argument from `_parallel_build_estimators()`,
  this value is now accessed via `ensemble._max_samples`.
- Removed validation of `ensemble.max_features` and `max_samples`,
  instead use `ensemble._max_features` and `ensemble._max_samples`
  which are assumed to be already validated.
- Removed unnecessary `samples` variable from `_parallel_build_estimators()`.
- Changed the way `random_state` is generated in `_parallel_build_estimators()`
  and `BaseBagging._get_estimators_data_draws()` to direct numpy method to
  reflect that seeds created in `BaseBagging._fit()` are trustworthy.
- Due to above removed generation of new seed for each estimator in
  `_parallel_build_estimators()` and `BaseBagging._get_estimators_data_draws()`.
- Added documentation to `BaseBagging.estimators_samples_()` property
  indicating the reason it's generated dynamically and the associated performance
  penalty.
- Returned `BaggingClassifier._set_oob_score()` and `BaggingRegressor._set_oob_score()`
  to directly accessing `self.estimators_samples_`.

* Streamlined code, improved `estimators_samples_()` documentation.
- Removed `max_samples` argument from `_parallel_build_estimators()`,
  this value is now accessed via `ensemble._max_samples`.
- Removed validation of `ensemble.max_features` and `max_samples`,
  instead use `ensemble._max_features` and `ensemble._max_samples`
  which are assumed to be already validated.
- Removed unnecessary `samples` variable from `_parallel_build_estimators()`.
- Changed the way `random_state` is generated in `_parallel_build_estimators()`
  and `BaseBagging._get_estimators_data_draws()` to direct numpy method to
  reflect that seeds created in `BaseBagging._fit()` are trustworthy.
- Due to above removed generation of new seed for each estimator in
  `_parallel_build_estimators()` and `BaseBagging._get_estimators_data_draws()`.
- Added documentation to `BaseBagging.estimators_samples_()` property
  indicating the reason it's generated dynamically and the associated performance
  penalty.
- Returned `BaggingClassifier._set_oob_score()` and `BaggingRegressor._set_oob_score()`
  to directly accessing `self.estimators_samples_`.

* PEP8, remove optional arguments, add public function to utils.
- Update indentation and line length to conform to PEP8.
- Update `estimators_samples_()` docstring to conform to PEP8 and
  PEP257.
- Remove optional arguments to `BaseBagging._get_estimators_data_draws()`,
  rename to `BaseBagging._get_estimators_indices()`.
- Remove optional arguments from `_generate_mask_from_indices()`,
  move to `utils.metaestimators`, rename to `indices_to_mask()`.

* Add indices_to_mask to __all__ in utils.metaestimators

* Move utils function, new test for `estimators_samples` correctness.
- Moved `indices_to_mask` from `utils.metaestimators` to `utils.validation`
- Added new test to `ensemble.test_bagging` to make sure refitting
  an individual estimator from the bagging ensemble using the corresponding
  samples identified in `BaseBagging.estimators_samples` returns the same model.

* Adjust/remove tests, move `indices_to_mask` to `utils.__init__`.
- Remove test_bagging.test_consistent_index_sampling
- Update formatting checks in test_bagging.test_estimators_samples
  to make sure each mask is numpy boolean array
- Move indices_to_mask from utils.validation to utils.__init__

* update class docstring for `estimators_samples_`

- Updated class docstring of `estimators_samples_` attribute for `BaggingClassifier` and `BaggingRegressor` to indicate that samples are identified with boolean masks.

* PEP8, made `max_samples` arg optional in _fit()
- Fixed PEP8 issues.
- Gave `max_samples` argument of `BaseBagging._fit()` a default
  value of `None` as indicated in the docstring. Added code to
  check for and process `None` value.
- Added a few more comments to `_fit()`.

* 2 new `max_samples` consistency tests, update whats_new.rst

* Minor change

* Update whats_new.rst

Run Details

54849 of 58020 relevant lines covered (94.53%)

0.95 hits per line

Source Files on job 19879.1 (DISTRIB="ubuntu" PYTHON_VERSION="2.7" CYTHON_VERSION="0.23.4" CACHED_BUILD_DIR="$HOME/sklearn_build_ubuntu" COVERAGE=true)

scikit-learn / scikit-learn / 19879 / 1
17%
master: 17%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

Source Files on job 19879.1 (DISTRIB="ubuntu" PYTHON_VERSION="2.7" CYTHON_VERSION="0.23.4" CACHED_BUILD_DIR="$HOME/sklearn_build_ubuntu" COVERAGE=true)

scikit-learn / scikit-learn / 19879 / 1 17% master: 17%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

Source Files on job 19879.1 (DISTRIB="ubuntu" PYTHON_VERSION="2.7" CYTHON_VERSION="0.23.4" CACHED_BUILD_DIR="$HOME/sklearn_build_ubuntu" COVERAGE=true)

scikit-learn / scikit-learn / 19879 / 1
17%
master: 17%

README BADGES
x