Discussion:
threading error when training a RFC on a big dataset
(too old to reply)
Christian Jauvin
2012-09-22 20:10:05 UTC
Permalink
Hi,

I have been doing multiple experiments using a RandomForestClassifier
(trained with the parallel code option) recently, without encountering
any particular problem. However as soon as I began using a much bigger
dataset (with the exact same code), I got this threading error:

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks
put(task)
SystemError: NULL result without error in PyObject_Call

I can provide additional details of course, but first maybe there is
something in particular I should be aware of, about size or memory
limit of the underlying objects in question?

Thanks,

Christian
Olivier Grisel
2012-09-22 20:18:54 UTC
Permalink
Post by Christian Jauvin
Hi,
I have been doing multiple experiments using a RandomForestClassifier
(trained with the parallel code option) recently, without encountering
any particular problem. However as soon as I began using a much bigger
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks
put(task)
SystemError: NULL result without error in PyObject_Call
I can provide additional details of course, but first maybe there is
something in particular I should be aware of, about size or memory
limit of the underlying objects in question?
It can be a memory error as the current implementation is very bad at
managing the memory.

You can try to replace the joblib folder in the sklearn source tree by
the "pickling-pool" branch of my repo:

https://github.com/joblib/joblib/pull/44

That should help a lot. You can further memmap your original dataset
has explained in the following doc to get even better memory usage
reduction:

https://github.com/ogrisel/joblib/blob/pickling-pool/doc/parallel_numpy.rst

You might also want to set the TMP environment variable to a folder on
a big partition.

I am very interested in any feedback while using this branch.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Christian Jauvin
2012-09-24 19:23:21 UTC
Permalink
Thank you Olivier for these suggestions.

I'd try/test them with pleasure, but meanwhile I discovered that there
was just no way the dataset I was trying to use would ever fit in the
72GB of memory of the machine I'm using. So I just scaled it down, and
obviously this error is not happening anymore.

But I'd be curious to know if there are any mechanism I could use to
allow a Random Forest classifier to work with bigger datasets (than
what simply fits in memory)?

Thanks!
Post by Olivier Grisel
Post by Christian Jauvin
Hi,
I have been doing multiple experiments using a RandomForestClassifier
(trained with the parallel code option) recently, without encountering
any particular problem. However as soon as I began using a much bigger
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks
put(task)
SystemError: NULL result without error in PyObject_Call
I can provide additional details of course, but first maybe there is
something in particular I should be aware of, about size or memory
limit of the underlying objects in question?
It can be a memory error as the current implementation is very bad at
managing the memory.
You can try to replace the joblib folder in the sklearn source tree by
https://github.com/joblib/joblib/pull/44
That should help a lot. You can further memmap your original dataset
has explained in the following doc to get even better memory usage
https://github.com/ogrisel/joblib/blob/pickling-pool/doc/parallel_numpy.rst
You might also want to set the TMP environment variable to a folder on
a big partition.
I am very interested in any feedback while using this branch.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
How fast is your code?
3 out of 4 devs don\\\'t know how their code performs in production.
Find out how slow your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219672;13503038;z?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2012-09-24 21:43:09 UTC
Permalink
I think @glouppe is likely to contribute some evolution for the ensembles
of trees models once he gets back from ECML 2012 where he has a paper on
those issues.
Joseph Turian
2012-09-24 20:06:58 UTC
Permalink
Chris Lin iirc has advocated partitioning the examples then concatenation the individual classifiers.

You could do that and then do a second pass of learning: find the 1% of examples that are the hardest for the ensemble and learn over them.

Regardless, it will be adhoc unless you use an out of core algorithm.

Von meinem iPhone gesendet
Post by Christian Jauvin
Thank you Olivier for these suggestions.
I'd try/test them with pleasure, but meanwhile I discovered that there
was just no way the dataset I was trying to use would ever fit in the
72GB of memory of the machine I'm using. So I just scaled it down, and
obviously this error is not happening anymore.
But I'd be curious to know if there are any mechanism I could use to
allow a Random Forest classifier to work with bigger datasets (than
what simply fits in memory)?
Thanks!
Post by Olivier Grisel
Post by Christian Jauvin
Hi,
I have been doing multiple experiments using a RandomForestClassifier
(trained with the parallel code option) recently, without encountering
any particular problem. However as soon as I began using a much bigger
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks
put(task)
SystemError: NULL result without error in PyObject_Call
I can provide additional details of course, but first maybe there is
something in particular I should be aware of, about size or memory
limit of the underlying objects in question?
It can be a memory error as the current implementation is very bad at
managing the memory.
You can try to replace the joblib folder in the sklearn source tree by
https://github.com/joblib/joblib/pull/44
That should help a lot. You can further memmap your original dataset
has explained in the following doc to get even better memory usage
https://github.com/ogrisel/joblib/blob/pickling-pool/doc/parallel_numpy.rst
You might also want to set the TMP environment variable to a folder on
a big partition.
I am very interested in any feedback while using this branch.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
How fast is your code?
3 out of 4 devs don\\\'t know how their code performs in production.
Find out how slow your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219672;13503038;z?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2012-09-25 09:28:56 UTC
Permalink
Post by Joseph Turian
Chris Lin iirc has advocated partitioning the examples then concatenation the individual classifiers.
You could do that and then do a second pass of learning: find the 1% of examples that are the hardest for the ensemble and learn over them.
Regardless, it will be adhoc unless you use an out of core algorithm.
Interesting, do you have a link to the paper?

Gilles' paper I was mentioning previously is here:
http://www.cs.bris.ac.uk/~flach/ECMLPKDD2012papers/1125540.pdf
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Joseph Turian
2012-09-26 06:31:29 UTC
Permalink
My mistake, I meant Jimmy Lin:

MapReduce is Good Enough? If All You Have is a Hammer, Throw Away
Everything That's Not a Nail!

http://arxiv.org/abs/1209.2191

On Tue, Sep 25, 2012 at 2:28 AM, Olivier Grisel
Post by Olivier Grisel
Post by Joseph Turian
Chris Lin iirc has advocated partitioning the examples then concatenation the individual classifiers.
You could do that and then do a second pass of learning: find the 1% of examples that are the hardest for the ensemble and learn over them.
Regardless, it will be adhoc unless you use an out of core algorithm.
Interesting, do you have a link to the paper?
http://www.cs.bris.ac.uk/~flach/ECMLPKDD2012papers/1125540.pdf
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Joseph Turian, Ph.D. | President, MetaOptimize
"Optimize Profits. Optimize Engagement."
http://metaoptimize.com
855-ALL-DATA

The web's most active forum for data scientists: http://metaoptimize.com/qa/
Loading...