[Scikit-learn-general] How you free up memory or handle it while fitting/cross-validating model in Scikitlearn?

Discussion:

muhammad waseem

2016-02-12 16:35:05 UTC

Sebastian Raschka

2016-02-12 16:42:49 UTC

Hi, Waseem,
I think lowering the value of n_jobs would help; as far as I know, each process get a copy of the data? Just stumbled upon spark-sklearn a few days ago, maybe that could help as well:

https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html

When I understand correctly, the data is still copied, but here, each node gets a copy instead of one machine with many copies.

Hi,
I am trying to fit my model using regression trees but the problem is, it consumes a lot of RAM, which makes my code unresponsive. By looking at different forums and platforms, I think this is a common problem. I was wondering, how you free up memory or what are the best ways to run the fitting process/cross-validation without running out of memory? This problem is mostly with all regression trees (I think with other ML algorithms as well). Shall I try to run without n_job=-1 and use some other value (e.g. n_jobs=10) in cross_validation?
Thanks
Kindest Regards
Waseem
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Manoj Kumar

2016-02-12 17:19:21 UTC

Permalink

Hi Sebastian,

This is true but only if the data is less than 1M. After that it is
memmapped to a temp folder and is shared by all processes (
https://pythonhosted.org/joblib/parallel.html#working-with-numerical-data-in-shared-memory-memmaping
)

You can try varying "max_nbytes" parameter wherever Parallel is called in
the regression tress to trigger memmap conversion even with smaller size of
data and prevent duplication of data across all processes.

Post by Sebastian Raschka
Hi, Waseem,
I think lowering the value of n_jobs would help; as far as I know, each
process get a copy of the data? Just stumbled upon spark-sklearn a few days
https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html
When I understand correctly, the data is still copied, but here, each node
gets a copy instead of one machine with many copies.

Post by muhammad waseem
Hi,
I am trying to fit my model using regression trees but the problem is,

it consumes a lot of RAM, which makes my code unresponsive. By looking at
different forums and platforms, I think this is a common problem. I was
wondering, how you free up memory or what are the best ways to run the
fitting process/cross-validation without running out of memory? This
problem is mostly with all regression trees (I think with other ML
algorithms as well). Shall I try to run without n_job=-1 and use some other
value (e.g. n_jobs=10) in cross_validation?

Post by muhammad waseem
Thanks
Kindest Regards
Waseem

------------------------------------------------------------------------------

http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________

Post by muhammad waseem
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Manoj,
http://github.com/MechCoder

muhammad waseem

2016-02-12 17:29:10 UTC

Permalink

Hi Sebastian and Manoj,
@Manoj: What should be the value of max_nbytes parameter and will this
affect the results and time it takes to run cross_validation, grid_search
etc?

Thanks
Kindest Regards
Waseem

Post by muhammad waseem
Hi,
I am trying to fit my model using regression trees but the problem is,

Post by muhammad waseem
Thanks
Kindest Regards
Waseem

------------------------------------------------------------------------------

http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________

Post by muhammad waseem
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

muhammad waseem

2016-02-12 17:30:22 UTC

Permalink

Hi Sebastian and Manoj,
@Manoj: What should be the value of max_nbytes parameter and will this
affect the results and time it takes to run cross_validation, grid_search
etc?
@Sebastian: Will the Spark implication will also improve the memory use or
just the CPU?

Thanks
Kindest Regards

Post by Sebastian Raschka
Hi, Waseem,
I think lowering the value of n_jobs would help; as far as I know, each
process get a copy of the data? Just stumbled upon spark-sklearn a few days
https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html
When I understand correctly, the data is still copied, but here, each
node gets a copy instead of one machine with many copies.

Post by muhammad waseem
Hi,
I am trying to fit my model using regression trees but the problem is,

Post by muhammad waseem
Thanks
Kindest Regards
Waseem

------------------------------------------------------------------------------

http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________

Post by muhammad waseem
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Dr Muhammad Waseem Ahmad
Research Associate,
BRE Center for Sustainable Construction,

School of Engineering,

Cardiff University,

Cardiff, UK.

Sebastian Raschka

2016-02-12 18:40:38 UTC

Permalink

Thanks for the note, Manoj, didn't know that!

@muhammad So if there's no duplication of data across all processes, I guess that the you would also run into troubles with n_jobs=1. But just to make sure that data duplication is not an issue, could you try running it with n_jobs=1? In this case, probably only a smaller data set or machine with larger memory would help. Here, I'd probably think about using Spark's MLlib to deal with this particular dataset.

Post by muhammad waseem
Hi Sebastian and Manoj,
@Manoj: What should be the value of max_nbytes parameter and will this affect the results and time it takes to run cross_validation, grid_search etc?
@Sebastian: Will the Spark implication will also improve the memory use or just the CPU?
Thanks
Kindest Regards
Hi Sebastian and Manoj,
@Manoj: What should be the value of max_nbytes parameter and will this affect the results and time it takes to run cross_validation, grid_search etc?
Thanks
Kindest Regards
Waseem
Hi, Waseem,
https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html <https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html>
When I understand correctly, the data is still copied, but here, each node gets a copy instead of one machine with many copies.

Hi,
I am trying to fit my model using regression trees but the problem is, it consumes a lot of RAM, which makes my code unresponsive. By looking at different forums and platforms, I think this is a common problem. I was wondering, how you free up memory or what are the best ways to run the fitting process/cross-validation without running out of memory? This problem is mostly with all regression trees (I think with other ML algorithms as well). Shall I try to run without n_job=-1 and use some other value (e.g. n_jobs=10) in cross_validation?
Thanks
Kindest Regards
Waseem
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________ <http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________>
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>

muhammad waseem

2016-02-12 19:57:30 UTC

Permalink

Post by Sebastian Raschka
Thanks for the note, Manoj, didn't know that!
@muhammad So if there's no duplication of data across all processes, I
guess that the you would also run into troubles with n_jobs=1. But just to
make sure that data duplication is not an issue, could you try running it
with n_jobs=1? In this case, probably only a smaller data set or machine
with larger memory would help. Here, I'd probably think about using Spark's
MLlib to deal with this particular dataset.
Hi Sebastian and Manoj,
@Manoj: What should be the value of max_nbytes parameter and will this
affect the results and time it takes to run cross_validation, grid_search
etc?
@Sebastian: Will the Spark implication will also improve the memory use or just the CPU?
Thanks
Kindest Regards

Post by Sebastian Raschka
Hi, Waseem,
I think lowering the value of n_jobs would help; as far as I know, each
process get a copy of the data? Just stumbled upon spark-sklearn a few days
https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html
When I understand correctly, the data is still copied, but here, each
node gets a copy instead of one machine with many copies.

On Feb 12, 2016, at 11:35 AM, muhammad waseem <
Hi,
I am trying to fit my model using regression trees but the problem is,

Thanks
Kindest Regards
Waseem

------------------------------------------------------------------------------

http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________

Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Sebastian Raschka

2016-02-12 20:32:50 UTC

Permalink

I'd suggest trying n_jobs=1 and check if swap memory is used (you don't have to run it until completion). If this runs fine without swap, we can work further from there.

Sent from my iPhone

Post by Sebastian Raschka
Hi, Waseem,
https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html
When I understand correctly, the data is still copied, but here, each node gets a copy instead of one machine with many copies.

Jacob Schreiber

2016-02-12 21:58:06 UTC

Permalink

I don't think that the data is copied for tree based classifiers. It uses
the threading backend, so each thread should be sharing memory.

Post by Sebastian Raschka
I'd suggest trying n_jobs=1 and check if swap memory is used (you don't
have to run it until completion). If this runs fine without swap, we can
work further from there.
Sent from my iPhone
@Sebastian: I tried with n_jobs=10 (total is equal to 12) and it still
created the same problem. I could try running it by using n_jobs=1 but it
would be so slow that it will take ages to complete. The machine has 32GB
RAM and it started using Swap memory after consuming full RAM.
Is there a way to tackle or you really think that all this k-fold cross
validation, training should be done using Spark's MLib?
Thanks
Regards
Waseem

Post by Sebastian Raschka
Hi, Waseem,
I think lowering the value of n_jobs would help; as far as I know, each
process get a copy of the data? Just stumbled upon spark-sklearn a few days
https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html
When I understand correctly, the data is still copied, but here, each
node gets a copy instead of one machine with many copies.

On Feb 12, 2016, at 11:35 AM, muhammad waseem <
Hi,
I am trying to fit my model using regression trees but the problem

is, it consumes a lot of RAM, which makes my code unresponsive. By looking
at different forums and platforms, I think this is a common problem. I was
wondering, how you free up memory or what are the best ways to run the
fitting process/cross-validation without running out of memory? This
problem is mostly with all regression trees (I think with other ML
algorithms as well). Shall I try to run without n_job=-1 and use some other
value (e.g. n_jobs=10) in cross_validation?

Thanks
Kindest Regards
Waseem

------------------------------------------------------------------------------

http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________

Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

muhammad waseem

2016-02-15 14:37:09 UTC

Permalink

Post by Jacob Schreiber
I don't think that the data is copied for tree based classifiers. It uses
the threading backend, so each thread should be sharing memory.

Post by Sebastian Raschka
Hi, Waseem,
I think lowering the value of n_jobs would help; as far as I know,
each process get a copy of the data? Just stumbled upon spark-sklearn a few
https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html
When I understand correctly, the data is still copied, but here, each
node gets a copy instead of one machine with many copies.

On Feb 12, 2016, at 11:35 AM, muhammad waseem <
Hi,
I am trying to fit my model using regression trees but the problem

Thanks
Kindest Regards
Waseem

------------------------------------------------------------------------------

Site24x7 APM Insight: Get Deep Visibility into Application

Performance

APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!

http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________

Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Sebastian Raschka

2016-02-15 21:25:32 UTC

Permalink

Hm, unfortunately, that's what I thought -- sounds like a bug involved in joblib? Does someone has any ideas how to track this down?

@Waseem Can you also try n_jobs=2? Here, I'd expect that it
1) would use maybe 2 times the 12% plus a little bit extra if everything is working correctly with the multi-threading.
2) If you see something like ~30%, I'd say that there's an unnecessary copy made
3) If you see something like > 30% there would be a memory leak somewhere

I mentioned scenario 3, because I observed a very similar behavior once:
(see https://github.com/scikit-learn/scikit-learn/issues/3973)

"I made some weird observations that my GridSearches keep failing after a couple of hours and I initially couldn't figure out why. I monitored the memory usage then over time and saw that it it started with a few gigabytes (~6 Gb) and kept increasing until it crashed the node when it reached the max. 128 Gb the hardware can take. I was experimenting with random forests for classification of a large number of text documents. For simplicity -- to figure out what's going on -- I went back to naive Bayes.
...
After some experimentation, I finally found out that

gc.collect()
len(gc.get_objects()) # particularly this part!

in the for loop solves the problem and the memory usage stays constantly at 6.5 Gb over the run time of ~10 hours.

@Sebastian: I have tried to run cross_validation by using n_jobs=1 and it did not use SWAP memory, even the RAM usage was quite low (maximum 12%). However, this will take a longer time to finish. Any idea what to try now?
Thanks
Kindest Regards
Waseem
I don't think that the data is copied for tree based classifiers. It uses the threading backend, so each thread should be sharing memory.
I'd suggest trying n_jobs=1 and check if swap memory is used (you don't have to run it until completion). If this runs fine without swap, we can work further from there.
Sent from my iPhone

@Sebastian: I tried with n_jobs=10 (total is equal to 12) and it still created the same problem. I could try running it by using n_jobs=1 but it would be so slow that it will take ages to complete. The machine has 32GB RAM and it started using Swap memory after consuming full RAM.
Is there a way to tackle or you really think that all this k-fold cross validation, training should be done using Spark's MLib?
Thanks
Regards
Waseem
Thanks for the note, Manoj, didn't know that!
@muhammad So if there's no duplication of data across all processes, I guess that the you would also run into troubles with n_jobs=1. But just to make sure that data duplication is not an issue, could you try running it with n_jobs=1? In this case, probably only a smaller data set or machine with larger memory would help. Here, I'd probably think about using Spark's MLlib to deal with this particular dataset.

Post by muhammad waseem
Hi Sebastian and Manoj,
@Manoj: What should be the value of max_nbytes parameter and will this affect the results and time it takes to run cross_validation, grid_search etc?
@Sebastian: Will the Spark implication will also improve the memory use or just the CPU?
Thanks
Kindest Regards
Hi Sebastian and Manoj,
@Manoj: What should be the value of max_nbytes parameter and will this affect the results and time it takes to run cross_validation, grid_search etc?
Thanks
Kindest Regards
Waseem
Hi, Waseem,
https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html
When I understand correctly, the data is still copied, but here, each node gets a copy instead of one machine with many copies.

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

muhammad waseem

2016-02-17 19:25:36 UTC

Permalink

Post by Sebastian Raschka
Hm, unfortunately, that's what I thought -- sounds like a bug involved in
joblib? Does someone has any ideas how to track this down?
@Waseem Can you also try n_jobs=2? Here, I'd expect that it
1) would use maybe 2 times the 12% plus a little bit extra if everything
is working correctly with the multi-threading.
2) If you see something like ~30%, I'd say that there's an unnecessary copy made
3) If you see something like > 30% there would be a memory leak somewhere
(see https://github.com/scikit-learn/scikit-learn/issues/3973)
"I made some weird observations that my GridSearches keep failing after a
couple of hours and I initially couldn't figure out why. I monitored the
memory usage then over time and saw that it it started with a few gigabytes
(~6 Gb) and kept increasing until it crashed the node when it reached the
max. 128 Gb the hardware can take. I was experimenting with random forests
for classification of a large number of text documents. For simplicity --
to figure out what's going on -- I went back to naive Bayes.
...
After some experimentation, I finally found out that
gc.collect()
len(gc.get_objects()) # particularly this part!
in the for loop solves the problem and the memory usage stays constantly
at 6.5 Gb over the run time of ~10 hours.

Post by muhammad waseem
@Sebastian: I have tried to run cross_validation by using n_jobs=1 and

it did not use SWAP memory, even the RAM usage was quite low (maximum 12%).
However, this will take a longer time to finish. Any idea what to try now?

Post by muhammad waseem
Thanks
Kindest Regards
Waseem
On Fri, Feb 12, 2016 at 9:58 PM, Jacob Schreiber <
I don't think that the data is copied for tree based classifiers. It

uses the threading backend, so each thread should be sharing memory.

Post by muhammad waseem
On Fri, Feb 12, 2016 at 12:32 PM, Sebastian Raschka <
I'd suggest trying n_jobs=1 and check if swap memory is used (you don't

have to run it until completion). If this runs fine without swap, we can
work further from there.

Post by muhammad waseem
Sent from my iPhone

Post by muhammad waseem
@Sebastian: I tried with n_jobs=10 (total is equal to 12) and it still

created the same problem. I could try running it by using n_jobs=1 but it
would be so slow that it will take ages to complete. The machine has 32GB
RAM and it started using Swap memory after consuming full RAM.