Discussion:
Distributed RandomForests
(too old to reply)
Youssef Barhomi
2013-04-25 01:19:04 UTC
Permalink
Hello,

I am trying to reproduce the results of this paper:
http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
different kinds of data (monkey depth maps instead of humans). So I am
generating my depth features and training and classifying data with a
random forest with quite similar parameters of the paper.

I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
samples with 500 features. Since it seems to be a large dataset of feature
vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
the last one seemed to be slower than a O(n_samples*n_features*log(n_samples))
according to this:
http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
samples are taking a long time and I don't know when they will be done, I
would like better ways to estimate the ETA or find a way to speed up the
processing training. Also, I am watching my memory usage and I don't seem
to be swapping (29GB/48GB being used right now). The other thing is that I
requested n_jobs = -1 so it could use all cores of my machine (24 cores)
but looking to my CPU usage, it doesn't seem to be using any of them...

So, do you guys have any ideas on:
- would a 1E8 samples be doable with your implementation of random forests
(3 trees , 20 levels deep)?
- running this code on a cluster using different iPython engines? or would
that require a lot of work?
- PCA for dimensionality reduction? (on the paper, they haven't used any
dim reduction, so I am trying to avoid that)
- other implementations that I could use for large datasets?

PS: I am very new to this library but I am already impressed!! It's one of
the cleanest and probably most intuitive machine learning libraries out
there with a pretty impressive documentation and tutorials. Pretty amazing
work!!

Thank you very much,
Youssef


####################################
#######Here is a code snippet:
####################################

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
import time
import numpy as np

n_samples = 1000
n_features = 500
X, y = make_classification(n_samples, n_features, n_redundant=0,
n_informative=2,
random_state=1, n_clusters_per_class=1)
clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
'entropy', n_jobs = -1, verbose = 10)

rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
tic = time.time()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print 'Time taken:', time.time() - tic, 'seconds'
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
Brian Holt
2013-04-25 04:54:30 UTC
Permalink
Hi Youssef,

You're trying to do exactly what I did. First thing to note is that the
Microsoft guys don't precompute the features, rather they compute them on
the fly. That means that they only need enough memory to store the depth
images, and since they have a 1000 core cluster, computing the features is
much less of a problem for them.

If you profile your program my guess is that you'll find that the
bottleneck as you scale up to 1M dimensions and higher is the argsorting of
all your data. I did some work to argsort down a feature only when required
which made it a bit slower but more tractable. Unfortunately the code base
has changed a lot since I did that so my PR is out of date. You're welcome
to pick it up and update it if you want for your own work, although I'm not
sure it would be accepted upstream.

I'm sorry I can't be more help - it's tricky trying to replicate work when
you have vastly different tools.

Regards
Brian
Post by Youssef Barhomi
Hello,
http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
different kinds of data (monkey depth maps instead of humans). So I am
generating my depth features and training and classifying data with a
random forest with quite similar parameters of the paper.
I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
samples with 500 features. Since it seems to be a large dataset of feature
vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
the last one seemed to be slower than a O(n_samples*n_features*log(n_samples))
http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
samples are taking a long time and I don't know when they will be done, I
would like better ways to estimate the ETA or find a way to speed up the
processing training. Also, I am watching my memory usage and I don't seem
to be swapping (29GB/48GB being used right now). The other thing is that I
requested n_jobs = -1 so it could use all cores of my machine (24 cores)
but looking to my CPU usage, it doesn't seem to be using any of them...
- would a 1E8 samples be doable with your implementation of random forests
(3 trees , 20 levels deep)?
- running this code on a cluster using different iPython engines? or would
that require a lot of work?
- PCA for dimensionality reduction? (on the paper, they haven't used any
dim reduction, so I am trying to avoid that)
- other implementations that I could use for large datasets?
PS: I am very new to this library but I am already impressed!! It's one of
the cleanest and probably most intuitive machine learning libraries out
there with a pretty impressive documentation and tutorials. Pretty amazing
work!!
Thank you very much,
Youssef
####################################
####################################
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
import time
import numpy as np
n_samples = 1000
n_features = 500
X, y = make_classification(n_samples, n_features, n_redundant=0,
n_informative=2,
random_state=1, n_clusters_per_class=1)
clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
'entropy', n_jobs = -1, verbose = 10)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
tic = time.time()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print 'Time taken:', time.time() - tic, 'seconds'
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gilles Louppe
2013-04-25 06:38:03 UTC
Permalink
Hi Youssef,

Regarding memory usage, you should know that it'll basically blow up if you
increase the number of jobs. With the current implementation, you'll need
O(n_jobs * |X| * 2) in memory space (where |X| is the size of X, in bytes).
That issue stems from the use of joblib which basically forces us to
dupplicate the dataset as many times as you spawn processes. In the end,
this also induces a huge overhead in terms of CPU time (because of the back
and forth transfers of all these huge Python objects).

There is one PR (https://github.com/joblib/joblib/pull/44) that tries to
solve that by allowing objects to put into shared memory segments, but it
is still a work in progress though.

Gilles
Post by Brian Holt
Hi Youssef,
You're trying to do exactly what I did. First thing to note is that the
Microsoft guys don't precompute the features, rather they compute them on
the fly. That means that they only need enough memory to store the depth
images, and since they have a 1000 core cluster, computing the features is
much less of a problem for them.
If you profile your program my guess is that you'll find that the
bottleneck as you scale up to 1M dimensions and higher is the argsorting of
all your data. I did some work to argsort down a feature only when required
which made it a bit slower but more tractable. Unfortunately the code base
has changed a lot since I did that so my PR is out of date. You're welcome
to pick it up and update it if you want for your own work, although I'm not
sure it would be accepted upstream.
I'm sorry I can't be more help - it's tricky trying to replicate work when
you have vastly different tools.
Regards
Brian
Post by Youssef Barhomi
Hello,
http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
different kinds of data (monkey depth maps instead of humans). So I am
generating my depth features and training and classifying data with a
random forest with quite similar parameters of the paper.
I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
samples with 500 features. Since it seems to be a large dataset of feature
vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
the last one seemed to be slower than a O(n_samples*n_features*log(n_samples))
http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
samples are taking a long time and I don't know when they will be done, I
would like better ways to estimate the ETA or find a way to speed up the
processing training. Also, I am watching my memory usage and I don't
seem to be swapping (29GB/48GB being used right now). The other thing is
that I requested n_jobs = -1 so it could use all cores of my machine (24
cores) but looking to my CPU usage, it doesn't seem to be using any of
them...
- would a 1E8 samples be doable with your implementation of random
forests (3 trees , 20 levels deep)?
- running this code on a cluster using different iPython engines? or
would that require a lot of work?
- PCA for dimensionality reduction? (on the paper, they haven't used any
dim reduction, so I am trying to avoid that)
- other implementations that I could use for large datasets?
PS: I am very new to this library but I am already impressed!! It's one
of the cleanest and probably most intuitive machine learning libraries out
there with a pretty impressive documentation and tutorials. Pretty amazing
work!!
Thank you very much,
Youssef
####################################
####################################
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
import time
import numpy as np
n_samples = 1000
n_features = 500
X, y = make_classification(n_samples, n_features, n_redundant=0,
n_informative=2,
random_state=1, n_clusters_per_class=1)
clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
'entropy', n_jobs = -1, verbose = 10)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
tic = time.time()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print 'Time taken:', time.time() - tic, 'seconds'
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt!
http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Youssef Barhomi
2013-04-25 13:42:03 UTC
Permalink
ohh makes total sense now!! thank you Gilles!!
Y
Post by Brian Holt
Hi Youssef,
Regarding memory usage, you should know that it'll basically blow up if
you increase the number of jobs. With the current implementation, you'll
need O(n_jobs * |X| * 2) in memory space (where |X| is the size of X, in
bytes). That issue stems from the use of joblib which basically forces us
to dupplicate the dataset as many times as you spawn processes. In the end,
this also induces a huge overhead in terms of CPU time (because of the back
and forth transfers of all these huge Python objects).
There is one PR (https://github.com/joblib/joblib/pull/44) that tries to
solve that by allowing objects to put into shared memory segments, but it
is still a work in progress though.
Gilles
Post by Brian Holt
Hi Youssef,
You're trying to do exactly what I did. First thing to note is that the
Microsoft guys don't precompute the features, rather they compute them on
the fly. That means that they only need enough memory to store the depth
images, and since they have a 1000 core cluster, computing the features is
much less of a problem for them.
If you profile your program my guess is that you'll find that the
bottleneck as you scale up to 1M dimensions and higher is the argsorting of
all your data. I did some work to argsort down a feature only when required
which made it a bit slower but more tractable. Unfortunately the code base
has changed a lot since I did that so my PR is out of date. You're welcome
to pick it up and update it if you want for your own work, although I'm not
sure it would be accepted upstream.
I'm sorry I can't be more help - it's tricky trying to replicate work
when you have vastly different tools.
Regards
Brian
Post by Youssef Barhomi
Hello,
http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
different kinds of data (monkey depth maps instead of humans). So I am
generating my depth features and training and classifying data with a
random forest with quite similar parameters of the paper.
I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
samples with 500 features. Since it seems to be a large dataset of feature
vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
the last one seemed to be slower than a O(n_samples*n_features*log(n_samples))
http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
samples are taking a long time and I don't know when they will be done, I
would like better ways to estimate the ETA or find a way to speed up the
processing training. Also, I am watching my memory usage and I don't
seem to be swapping (29GB/48GB being used right now). The other thing is
that I requested n_jobs = -1 so it could use all cores of my machine (24
cores) but looking to my CPU usage, it doesn't seem to be using any of
them...
- would a 1E8 samples be doable with your implementation of random
forests (3 trees , 20 levels deep)?
- running this code on a cluster using different iPython engines? or
would that require a lot of work?
- PCA for dimensionality reduction? (on the paper, they haven't used any
dim reduction, so I am trying to avoid that)
- other implementations that I could use for large datasets?
PS: I am very new to this library but I am already impressed!! It's one
of the cleanest and probably most intuitive machine learning libraries out
there with a pretty impressive documentation and tutorials. Pretty amazing
work!!
Thank you very much,
Youssef
####################################
####################################
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
import time
import numpy as np
n_samples = 1000
n_features = 500
X, y = make_classification(n_samples, n_features, n_redundant=0,
n_informative=2,
random_state=1, n_clusters_per_class=1)
clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
'entropy', n_jobs = -1, verbose = 10)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
tic = time.time()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print 'Time taken:', time.time() - tic, 'seconds'
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt!
http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt!
http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
Youssef Barhomi
2013-04-25 13:50:15 UTC
Permalink
Hi Brian,
thanks for your feedback. were you able to reproduce their results? how big
was your dataset that you have processed so far with an RF?
the MS people have used a distributed RF, so yes, the features I am
guessing were being computed in parallel on all these cores. Though, I am
still new to the RF algorithm, I wonder how they could parallelise that? I
am guessing by actually sending each tree node to an actual core? Also, I
think they have implemented a gpu version as well of their RF (I am
guessing that is what is actually being used on the xbox itself right now),
and that should pb speed up things. The other option I am guessing is to
use an online RF, any recommendations on that?
thanks a lot!


Y
Post by Brian Holt
Hi Youssef,
You're trying to do exactly what I did. First thing to note is that the
Microsoft guys don't precompute the features, rather they compute them on
the fly. That means that they only need enough memory to store the depth
images, and since they have a 1000 core cluster, computing the features is
much less of a problem for them.
If you profile your program my guess is that you'll find that the
bottleneck as you scale up to 1M dimensions and higher is the argsorting of
all your data. I did some work to argsort down a feature only when required
which made it a bit slower but more tractable. Unfortunately the code base
has changed a lot since I did that so my PR is out of date. You're welcome
to pick it up and update it if you want for your own work, although I'm not
sure it would be accepted upstream.
I'm sorry I can't be more help - it's tricky trying to replicate work when
you have vastly different tools.
Regards
Brian
Post by Youssef Barhomi
Hello,
http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
different kinds of data (monkey depth maps instead of humans). So I am
generating my depth features and training and classifying data with a
random forest with quite similar parameters of the paper.
I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
samples with 500 features. Since it seems to be a large dataset of feature
vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
the last one seemed to be slower than a O(n_samples*n_features*log(n_samples))
http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
samples are taking a long time and I don't know when they will be done, I
would like better ways to estimate the ETA or find a way to speed up the
processing training. Also, I am watching my memory usage and I don't
seem to be swapping (29GB/48GB being used right now). The other thing is
that I requested n_jobs = -1 so it could use all cores of my machine (24
cores) but looking to my CPU usage, it doesn't seem to be using any of
them...
- would a 1E8 samples be doable with your implementation of random
forests (3 trees , 20 levels deep)?
- running this code on a cluster using different iPython engines? or
would that require a lot of work?
- PCA for dimensionality reduction? (on the paper, they haven't used any
dim reduction, so I am trying to avoid that)
- other implementations that I could use for large datasets?
PS: I am very new to this library but I am already impressed!! It's one
of the cleanest and probably most intuitive machine learning libraries out
there with a pretty impressive documentation and tutorials. Pretty amazing
work!!
Thank you very much,
Youssef
####################################
####################################
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
import time
import numpy as np
n_samples = 1000
n_features = 500
X, y = make_classification(n_samples, n_features, n_redundant=0,
n_informative=2,
random_state=1, n_clusters_per_class=1)
clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
'entropy', n_jobs = -1, verbose = 10)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
tic = time.time()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print 'Time taken:', time.time() - tic, 'seconds'
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt!
http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
Peter Prettenhofer
2013-04-25 08:02:02 UTC
Permalink
Hi Youssef,

please make sure that you use the latest version of sklearn (>= 0.13) - we
did some enhancements to the sub-sampling procedure lately.

Looking at the RandomForest code - it seems that the jobs=-1 should not be
the issue for the parallel training of the trees since ``n_jobs =
min(cpu_count(), self.n_estimators)`` which should be just 3 in your case,
however, it will use cpu_count() processes to sort the feature values - so
the bottleneck might be here. Please try to set the n_jobs parameter to a
smaller constant (e.g. 4) and check if it works better.

having said that: 1E8 samples is pretty large - the largest dataset that
I've used so far was merely 1E6 but I've heard that people have used it for
larger datasets too (probably not 1E8 though).

Running the code on a cluster using IPython parallel should not be too hard
- RF is a pretty simple algorithm - you could either patch the existing
code to use IPython parallel instead of Joblib.Parallel (see forest.py) or
simply write you own RF code which directly uses
``DecisionTreeClassifier``. Also, you likely can skip bootstrapping - it
doesn't help much IMHO and can make the implementation a bit more
"involved" - AFAIK the MSR guys didn't used boostrapping for their Kinect
RF system...

When it comes to other implementations you could look at rt-rank [1], which
is a parallel implementation of both GBRT and RF; and WiseRF [2], which is
compatible with sklearn but you have to obtain a license (free trial and
academic version AFAIK).

HTH,

Peter

[1] https://sites.google.com/site/rtranking/

[2] http://about.wise.io/
Post by Youssef Barhomi
Hello,
http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
different kinds of data (monkey depth maps instead of humans). So I am
generating my depth features and training and classifying data with a
random forest with quite similar parameters of the paper.
I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
samples with 500 features. Since it seems to be a large dataset of feature
vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
the last one seemed to be slower than a O(n_samples*n_features*log(n_samples))
http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
samples are taking a long time and I don't know when they will be done, I
would like better ways to estimate the ETA or find a way to speed up the
processing training. Also, I am watching my memory usage and I don't seem
to be swapping (29GB/48GB being used right now). The other thing is that I
requested n_jobs = -1 so it could use all cores of my machine (24 cores)
but looking to my CPU usage, it doesn't seem to be using any of them...
- would a 1E8 samples be doable with your implementation of random forests
(3 trees , 20 levels deep)?
- running this code on a cluster using different iPython engines? or would
that require a lot of work?
- PCA for dimensionality reduction? (on the paper, they haven't used any
dim reduction, so I am trying to avoid that)
- other implementations that I could use for large datasets?
PS: I am very new to this library but I am already impressed!! It's one of
the cleanest and probably most intuitive machine learning libraries out
there with a pretty impressive documentation and tutorials. Pretty amazing
work!!
Thank you very much,
Youssef
####################################
####################################
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
import time
import numpy as np
n_samples = 1000
n_features = 500
X, y = make_classification(n_samples, n_features, n_redundant=0,
n_informative=2,
random_state=1, n_clusters_per_class=1)
clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
'entropy', n_jobs = -1, verbose = 10)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
tic = time.time()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print 'Time taken:', time.time() - tic, 'seconds'
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Youssef Barhomi
2013-04-25 13:41:14 UTC
Permalink
thank you very much Peter,

you are right about the n_jobs, something was going wrong with that. When
n_jobs = -1, for larger dataset (1E6 for this case), no cpu was being used
and the process was hanging for a while. getting n_jobs = 1 made everything
work.
yes, I will look into the iPython parallel and see if I can do that.
I have just tried wiseRF and it worked like a charm with almost the same
accuracy as the RF on sklearn with a 6x speedup so far. I was able to run a
1E6 x 500 dataset in 45 seconds with 14GB of RAM being used. I will try
rtranking sometimes today. Now I am memory bound obviously, would you
recommend an online RF library at this point?
Post by Youssef Barhomi
Post by Youssef Barhomi
Hello,
http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
different kinds of data (monkey depth maps instead of humans). So I am
generating my depth features and training and classifying data with a
random forest with quite similar parameters of the paper.
I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
samples with 500 features. Since it seems to be a large dataset of feature
vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
the last one seemed to be slower than a O(n_samples*n_features*log(n_samples))
http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
samples are taking a long time and I don't know when they will be done, I
would like better ways to estimate the ETA or find a way to speed up the
processing training. Also, I am watching my memory usage and I don't
seem to be swapping (29GB/48GB being used right now). The other thing is
that I requested n_jobs = -1 so it could use all cores of my machine (24
cores) but looking to my CPU usage, it doesn't seem to be using any of
them...
- would a 1E8 samples be doable with your implementation of random
forests (3 trees , 20 levels deep)?
- running this code on a cluster using different iPython engines? or
would that require a lot of work?
- PCA for dimensionality reduction? (on the paper, they haven't used any
dim reduction, so I am trying to avoid that)
- other implementations that I could use for large datasets?
PS: I am very new to this library but I am already impressed!! It's one
of the cleanest and probably most intuitive machine learning libraries out
there with a pretty impressive documentation and tutorials. Pretty amazing
work!!
Thank you very much,
Youssef
####################################
####################################
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
import time
import numpy as np
n_samples = 1000
n_features = 500
X, y = make_classification(n_samples, n_features, n_redundant=0,
n_informative=2,
random_state=1, n_clusters_per_class=1)
clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
'entropy', n_jobs = -1, verbose = 10)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
tic = time.time()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print 'Time taken:', time.time() - tic, 'seconds'
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt!
http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
Peter Prettenhofer
2013-04-25 14:17:32 UTC
Permalink
Youssef Barhomi
2013-04-26 13:40:16 UTC
Permalink
Thank you Peter, I found that the feature extraction was taking a lot of
extra memory and that was not related to wiseRF, so you were right.
Actually, from "top" it seems the training part was taking only an extra
20% of memory than the size of the dataset itself, wich is pretty
impressive. So at this point I am pretty memory bound because of the
dataset size. The only other way to deal with this would be a PCA, or a
distributed random forests. WiseRF people are working on "sequoia" which is
a RF that should run on the cloud, so I will definitely use that when it's
ready.




On Thu, Apr 25, 2013 at 10:17 AM, Peter Prettenhofer <
Post by unknown
Post by Youssef Barhomi
thank you very much Peter,
you are right about the n_jobs, something was going wrong with that. When
n_jobs = -1, for larger dataset (1E6 for this case), no cpu was being used
and the process was hanging for a while. getting n_jobs = 1 made everything
work.
yes, I will look into the iPython parallel and see if I can do that.
I have just tried wiseRF and it worked like a charm with almost the same
accuracy as the RF on sklearn with a 6x speedup so far. I was able to run a
1E6 x 500 dataset in 45 seconds with 14GB of RAM being used. I will try
rtranking sometimes today. Now I am memory bound obviously, would you
recommend an online RF library at this point?
1E6 in 45 seconds - that's really good
The memory consumption seems a little bit high though - for 1E6 x 500 I'd
expect roughly 4GB (assuming you use float64) - what's the memory
consumption right _before_ you call WiseRF.fit? Probably your memory
consumption peaks during the feature extraction. Make sure you free all
data structure except the data array - usually, the python interpreter
won't hand memory back to the operation system thus the memory consumption
reported by top will be higher than the actually allocated memory. To
further reduce memory consumption make sure that your array has dtype
np.float32; sklearn assumes float32 and will actually copy a float64 array
to float32; wiseRF does not do this AFAIK. Still, 1E8 won't fit into your
52GB box.
I don't have much experience with streaming / online RF - please drop me a
note about your progress here.
Post by Youssef Barhomi
Post by Youssef Barhomi
Post by Youssef Barhomi
Hello,
http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
different kinds of data (monkey depth maps instead of humans). So I am
generating my depth features and training and classifying data with a
random forest with quite similar parameters of the paper.
I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
samples with 500 features. Since it seems to be a large dataset of feature
vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
the last one seemed to be slower than a O(n_samples*n_features*log(n_samples))
http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
samples are taking a long time and I don't know when they will be done, I
would like better ways to estimate the ETA or find a way to speed up the
processing training. Also, I am watching my memory usage and I don't
seem to be swapping (29GB/48GB being used right now). The other thing is
that I requested n_jobs = -1 so it could use all cores of my machine (24
cores) but looking to my CPU usage, it doesn't seem to be using any of
them...
- would a 1E8 samples be doable with your implementation of random
forests (3 trees , 20 levels deep)?
- running this code on a cluster using different iPython engines? or
would that require a lot of work?
- PCA for dimensionality reduction? (on the paper, they haven't used
any dim reduction, so I am trying to avoid that)
- other implementations that I could use for large datasets?
PS: I am very new to this library but I am already impressed!! It's one
of the cleanest and probably most intuitive machine learning libraries out
there with a pretty impressive documentation and tutorials. Pretty amazing
work!!
Thank you very much,
Youssef
####################################
####################################
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
import time
import numpy as np
n_samples = 1000
n_features = 500
X, y = make_classification(n_samples, n_features, n_redundant=0,
n_informative=2,
random_state=1, n_clusters_per_class=1)
clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
'entropy', n_jobs = -1, verbose = 10)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
tic = time.time()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print 'Time taken:', time.time() - tic, 'seconds'
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt!
http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt!
http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt!
http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
Ronnie Ghose
2013-04-25 14:22:43 UTC
Permalink
I've tried larger data sets. It wasn't pretty, much fewer features though
Post by Brian Holt
Hi Youssef,
please make sure that you use the latest version of sklearn (>= 0.13) - we
did some enhancements to the sub-sampling procedure lately.
Looking at the RandomForest code - it seems that the jobs=-1 should not be
the issue for the parallel training of the trees since ``n_jobs =
min(cpu_count(), self.n_estimators)`` which should be just 3 in your case,
however, it will use cpu_count() processes to sort the feature values - so
the bottleneck might be here. Please try to set the n_jobs parameter to a
smaller constant (e.g. 4) and check if it works better.
having said that: 1E8 samples is pretty large - the largest dataset that
I've used so far was merely 1E6 but I've heard that people have used it for
larger datasets too (probably not 1E8 though).
Running the code on a cluster using IPython parallel should not be too
hard - RF is a pretty simple algorithm - you could either patch the
existing code to use IPython parallel instead of Joblib.Parallel (see
forest.py) or simply write you own RF code which directly uses
``DecisionTreeClassifier``. Also, you likely can skip bootstrapping - it
doesn't help much IMHO and can make the implementation a bit more
"involved" - AFAIK the MSR guys didn't used boostrapping for their Kinect
RF system...
When it comes to other implementations you could look at rt-rank [1],
which is a parallel implementation of both GBRT and RF; and WiseRF [2],
which is compatible with sklearn but you have to obtain a license (free
trial and academic version AFAIK).
HTH,
Peter
[1] https://sites.google.com/site/rtranking/
[2] http://about.wise.io/
Post by Youssef Barhomi
Hello,
http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
different kinds of data (monkey depth maps instead of humans). So I am
generating my depth features and training and classifying data with a
random forest with quite similar parameters of the paper.
I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
samples with 500 features. Since it seems to be a large dataset of feature
vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
the last one seemed to be slower than a O(n_samples*n_features*log(n_samples))
http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
samples are taking a long time and I don't know when they will be done, I
would like better ways to estimate the ETA or find a way to speed up the
processing training. Also, I am watching my memory usage and I don't
seem to be swapping (29GB/48GB being used right now). The other thing is
that I requested n_jobs = -1 so it could use all cores of my machine (24
cores) but looking to my CPU usage, it doesn't seem to be using any of
them...
- would a 1E8 samples be doable with your implementation of random
forests (3 trees , 20 levels deep)?
- running this code on a cluster using different iPython engines? or
would that require a lot of work?
- PCA for dimensionality reduction? (on the paper, they haven't used any
dim reduction, so I am trying to avoid that)
- other implementations that I could use for large datasets?
PS: I am very new to this library but I am already impressed!! It's one
of the cleanest and probably most intuitive machine learning libraries out
there with a pretty impressive documentation and tutorials. Pretty amazing
work!!
Thank you very much,
Youssef
####################################
####################################
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
import time
import numpy as np
n_samples = 1000
n_features = 500
X, y = make_classification(n_samples, n_features, n_redundant=0,
n_informative=2,
random_state=1, n_clusters_per_class=1)
clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
'entropy', n_jobs = -1, verbose = 10)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
tic = time.time()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print 'Time taken:', time.time() - tic, 'seconds'
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt!
http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
unknown
1970-01-01 00:00:00 UTC
Permalink
--001a11c25c5a53c2b304db30148e
Content-Type: text/plain; charset=ISO-8859-1
Post by Youssef Barhomi
thank you very much Peter,
you are right about the n_jobs, something was going wrong with that. When
n_jobs = -1, for larger dataset (1E6 for this case), no cpu was being used
and the process was hanging for a while. getting n_jobs = 1 made everything
work.
yes, I will look into the iPython parallel and see if I can do that.
I have just tried wiseRF and it worked like a charm with almost the same
accuracy as the RF on sklearn with a 6x speedup so far. I was able to run a
1E6 x 500 dataset in 45 seconds with 14GB of RAM being used. I will try
rtranking sometimes today. Now I am memory bound obviously, would you
recommend an online RF library at this point?
1E6 in 45 seconds - that's really good

The memory consumption seems a little bit high though - for 1E6 x 500 I'd
expect roughly 4GB (assuming you use float64) - what's the memory
consumption right _before_ you call WiseRF.fit? Probably your memory
consumption peaks during the feature extraction. Make sure you free all
data structure except the data array - usually, the python interpreter
won't hand memory back to the operation system thus the memory consumption
reported by top will be higher than the actually allocated memory. To
further reduce memory consumption make sure that your array has dtype
np.float32; sklearn assumes float32 and will actually copy a float64 array
to float32; wiseRF does not do this AFAIK. Still, 1E8 won't fit into your
52GB box.

I don't have much experience with streaming / online RF - please drop me a
note about your progress here.
Post by Youssef Barhomi
Post by Youssef Barhomi
Post by Youssef Barhomi
Hello,
http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
different kinds of data (monkey depth maps instead of humans). So I am
generating my depth features and training and classifying data with a
random forest with quite similar parameters of the paper.
I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
samples with 500 features. Since it seems to be a large dataset of feature
vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
the last one seemed to be slower than a O(n_samples*n_features*log(n_samples))
http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
samples are taking a long time and I don't know when they will be done, I
would like better ways to estimate the ETA or find a way to speed up the
processing training. Also, I am watching my memory usage and I don't
seem to be swapping (29GB/48GB being used right now). The other thing is
that I requested n_jobs = -1 so it could use all cores of my machine (24
cores) but looking to my CPU usage, it doesn't seem to be using any of
them...
- would a 1E8 samples be doable with your implementation of random
forests (3 trees , 20 levels deep)?
- running this code on a cluster using different iPython engines? or
would that require a lot of work?
- PCA for dimensionality reduction? (on the paper, they haven't used any
dim reduction, so I am trying to avoid that)
- other implementations that I could use for large datasets?
PS: I am very new to this library but I am already impressed!! It's one
of the cleanest and probably most intuitive machine learning libraries out
there with a pretty impressive documentation and tutorials. Pretty amazing
work!!
Thank you very much,
Youssef
####################################
####################################
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
import time
import numpy as np
n_samples = 1000
n_features = 500
X, y = make_classification(n_samples, n_features, n_redundant=0,
n_informative=2,
random_state=1, n_clusters_per_class=1)
clf = RandomForestClassifier(max_depth , n_estimators=3, criterion >>> 'entropy', n_jobs = -1, verbose = 10)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
tic = time.time()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print 'Time taken:', time.time() - tic, 'seconds'
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring
service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt!
http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring
service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt!
http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer

--001a11c25c5a53c2b304db30148e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable <div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">2013/4/25 Youssef Barhomi <span dir="ltr">&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div style="font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div class="gmail_extra"><div class="gmail_quote">
<div>

thank you very much Peter,</div><div><br></div><div>

you are right about the n_jobs, something was going wrong with that. When n_jobs = -1, for larger dataset (1E6 for this case), no cpu was being used and the process was hanging for a while. getting n_jobs = 1 made everything work.</div>


<div>yes, I will look into the iPython parallel and see if I can do that.</div><div>

I have just tried wiseRF and it worked like a charm with almost the same accuracy as the RF on sklearn with a 6x speedup so far. I was able to run a 1E6 x 500 dataset in 45 seconds with 14GB of RAM being used. I will try rtranking sometimes today. Now I am memory bound obviously, would you recommend an online RF library at this point?</div> </div></div></div></blockquote><div><br></div><div style>1E6 in 45 seconds - that&#39;s really good�</div><div><br></div><div style>The memory consumption seems a little bit high though - for 1E6 x 500 I&#39;d expect roughly 4GB (assuming you use float64) - what&#39;s the memory consumption right _before_ you call WiseRF.fit? Probably your memory consumption peaks during the feature extraction. Make sure you free all data structure except the data array - usually, the python interpreter won&#39;t hand memory back to the operation system thus the memory consumption reported by top will be higher than the actually allocated memory. To further reduce memory consumption make sure that your array has dtype np.float32; sklearn assumes float32 and will actually copy a float64 array to float32; wiseRF does not do this AFAIK. Still, 1E8 won&#39;t fit into your 52GB box.�</div> <div style><br></div><div style>I don&#39;t have much experience with streaming / online RF - please drop me a note about your progress here.</div><div style><br></div><div>�</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div class="h5"> <div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><p dir="ltr"></p> <div class="gmail_quote">Am <a href="tel:25.04.2013%2003" value="+12504201303" target="_blank">25.04.2013 03</a>:22 schrieb &quot;Youssef Barhomi&quot; &lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;:<br type="attribution">


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div>

<div dir="ltr"><div style="font-family:arial,helvetica,sans-serif;font-size:small">Hello,</div><div style="font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div style="font-family:arial,helvetica,sans-serif;font-size:small">






I am trying to reproduce the results of this paper: <a href="http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf" style="font-family:arial" target="_blank">http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf</a> with different kinds of data (monkey depth maps instead of humans). So I am generating my depth features and training  and classifying data with a random forest with quite similar parameters of the paper.</div>






<div style="font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div style="font-size:small"><font face="arial, helvetica, sans-serif">I would like to use sklearn.ensemble.RandomForestClassifier with 1E8 samples with 500 features. Since it seems to be a large dataset of feature vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and the last one seemed to be slower than a </font><span style="font-family:&#39;courier new&#39;,monospace">O(n_samples*n_features*log(n_samples)) </span><font face="arial, helvetica, sans-serif">according to this:</font><span style="font-family:&#39;courier new&#39;,monospace"> </span><a href="http://scikit-learn.org/stable/modules/tree.html#complexity" target="_blank">http://scikit-learn.org/stable/modules/tree.html#complexity</a><font face="courier new, monospace"> </font><span style="font-family:arial,helvetica,sans-serif">since 1E6 samples are taking a long time and I don&#39;t know when they will be done, I would like better ways to estimate the ETA or find a way to speed up the processing training.</span><span style="font-family:arial,helvetica,sans-serif"> Also, I am watching my memory usage and I don&#39;t seem to be swapping (29GB/48GB being used right now). The other thing is that I requested n_jobs = -1 so it could use all cores of my machine (24 cores) but looking to my CPU usage, it doesn&#39;t seem to be using any of them...</span></div>






<div style="font-size:small"><br></div><div style="font-family:arial,helvetica,sans-serif;font-size:small">So, do you guys have any ideas on:</div><div style="font-family:arial,helvetica,sans-serif;font-size:small">



- would a 1E8 samples be doable with your implementation of random forests (3 trees , 20 levels deep)?</div><div style="font-family:arial,helvetica,sans-serif;font-size:small">- running this code on a cluster using different iPython engines? or would that require a lot of work?<br>






</div><div style="font-family:arial,helvetica,sans-serif;font-size:small">- PCA for dimensionality reduction? (on the paper, they haven&#39;t used any dim reduction, so I am trying to avoid that)</div>



<div style="font-family:arial,helvetica,sans-serif;font-size:small">- other implementations that I could use for large datasets? </div><div style="font-family:arial,helvetica,sans-serif;font-size:small">



<br></div><div style="font-family:arial,helvetica,sans-serif;font-size:small">PS: I am very new to this library but I am already impressed!! It&#39;s one of the cleanest and probably most intuitive machine learning libraries out there with a pretty impressive documentation and tutorials. Pretty amazing work!! </div>






<div style="font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div style="font-family:arial,helvetica,sans-serif;font-size:small">Thank you very much,</div><div style="font-family:arial,helvetica,sans-serif;font-size:small">






Youssef</div><div style="font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div style="font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div style="font-size:small">



<font face="courier new, monospace">####################################</font></div><div style="font-size:small"><font face="courier new, monospace">#######Here is a code snippet:</font></div><div style="font-size:small">






<font face="courier new, monospace">####################################</font></div><div><font face="courier new, monospace"><br></font></div><div><font face="courier new, monospace">from sklearn.datasets import make_classification<br>






</font></div><div><font face="courier new, monospace">from sklearn.ensemble import RandomForestClassifier<br></font></div><div><font face="courier new, monospace">from sklearn.cross_validation import train_test_split<br>






</font></div><div><font face="courier new, monospace">from sklearn.preprocessing import StandardScaler<br></font></div><div><font face="courier new, monospace">import time</font></div>



<div><font face="courier new, monospace">import numpy as np</font></div><div><font face="courier new, monospace"><br></font></div><div style="font-size:small">



<font face="courier new, monospace">n_samples = 1000</font></div><div style="font-size:small"><font face="courier new, monospace">n_features = 500</font></div><div><div>



<font face="courier new, monospace">X, y = make_classification(n_samples, n_features, n_redundant=0, n_informative=2,</font></div><div><font face="courier new, monospace">                               random_state=1, n_clusters_per_class=1)</font></div>






<div><font face="courier new, monospace">clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion = &#39;entropy&#39;, n_jobs = -1, verbose = 10)<br></font></div><div><font face="courier new, monospace"><br></font></div>






</div><div><div><font face="courier new, monospace">rng = np.random.RandomState(2)</font></div><div><font face="courier new, monospace">X += 2 * rng.uniform(size=X.shape)</font></div>



<div><font face="courier new, monospace">linearly_separable = (X, y)</font></div><div><div><font face="courier new, monospace">X = StandardScaler().fit_transform(X)</font></div><div><font face="courier new, monospace">X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)</font></div>






<div><div><font face="courier new, monospace">tic = time.time()</font></div><div><font face="courier new, monospace">clf.fit(X_train, y_train)</font></div><div><font face="courier new, monospace">score = clf.score(X_test, y_test)</font></div>






<div><font face="courier new, monospace">print &#39;Time taken:&#39;, time.time() - tic, &#39;seconds&#39;</font></div><div style="font-size:small"><br></div></div></div></div><div><br></div>-- <br><font size="1" color="#999999" face="verdana, sans-serif"><div style="background-color:transparent">






<span style="background-color:transparent;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap">Youssef Barhomi, MSc, MEng.</span><br><span style="background-color:transparent;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap">Research Software Engineer at the CLPS department </span><br>






<span style="background-color:transparent;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap">Brown University</span><br><span style="background-color:transparent;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap">T: +1 </span><span style="background-color:transparent;font-weight:bold;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap"><a href="tel:%28617%29%20797%209929" value="+16177979929" target="_blank">(617) 797 9929</a></span><span style="background-color:transparent;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap">  | GMT -5:00</span></div>






</font>
</div>
<br></div></div><div>------------------------------------------------------------------------------<br>
Try New Relic Now &amp; We&#39;ll Send You this Cool Shirt<br>
New Relic is the only SaaS-based application performance monitoring service<br>
that delivers powerful full stack analytics. Optimize and monitor your<br>
browser, app, &amp; servers with just a few lines of code. Try New Relic<br>
and get this awesome Nerd Life shirt! <a href="http://p.sf.net/sfu/newrelic_d2d_apr" target="_blank">http://p.sf.net/sfu/newrelic_d2d_apr</a><br>_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></div></blockquote></div>
</div>
<br>------------------------------------------------------------------------------<br>
Try New Relic Now &amp; We&#39;ll Send You this Cool Shirt<br>
New Relic is the only SaaS-based application performance monitoring service<br>
that delivers powerful full stack analytics. Optimize and monitor your<br>
browser, app, &amp; servers with just a few lines of code. Try New Relic<br>
and get this awesome Nerd Life shirt! <a href="http://p.sf.net/sfu/newrelic_d2d_apr" target="_blank">http://p.sf.net/sfu/newrelic_d2d_apr</a><br>_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div></div></div><div><div class="h5"><br><br clear="all"><div><br></div>-- <br><font size="1" color="#999999" face="verdana, sans-serif"><div style="background-color:transparent"><span style="background-color:transparent;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap">Youssef Barhomi, MSc, MEng.</span><br>


<span style="background-color:transparent;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap">Research Software Engineer at the CLPS department </span><br><span style="background-color:transparent;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap">Brown University</span><br>


<span style="background-color:transparent;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap">T: +1 </span><span style="background-color:transparent;font-weight:bold;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap">(617) 797 9929</span><span style="background-color:transparent;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap">  | GMT -5:00</span></div>


</font>
</div></div></div></div>
<br>------------------------------------------------------------------------------<br>
Try New Relic Now &amp; We&#39;ll Send You this Cool Shirt<br>
New Relic is the only SaaS-based application performance monitoring service<br>
that delivers powerful full stack analytics. Optimize and monitor your<br>
browser, app, &amp; servers with just a few lines of code. Try New Relic<br>
and get this awesome Nerd Life shirt! <a href="http://p.sf.net/sfu/newrelic_d2d_apr" target="_blank">http://p.sf.net/sfu/newrelic_d2d_apr</a><br>_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br>Peter Prettenhofer<br>
</div></div>

--001a11c25c5a53c2b304db30148e--
Andreas Mueller
2013-04-27 18:03:45 UTC
Permalink
Hi Youssef.
I would strongly advise you to use a image specific random forest
implementation.
There is a very good implementation by some other MSRC people:
http://research.microsoft.com/en-us/downloads/03e0ca05-8aa9-49f6-801f-bb23846dc147/
It implements a much more complicated model, decision tree fields, but
can also be used for plain random forests.

Cheers,
Andy
Post by Youssef Barhomi
Hello,
http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
different kinds of data (monkey depth maps instead of humans). So I am
generating my depth features and training and classifying data with a
random forest with quite similar parameters of the paper.
I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
samples with 500 features. Since it seems to be a large dataset of
feature vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6
samples) and the last one seemed to be slower than a
O(n_samples*n_features*log(n_samples)) according to
this:http://scikit-learn.org/stable/modules/tree.html#complexitysince
1E6 samples are taking a long time and I don't know when they will be
done, I would like better ways to estimate the ETA or find a way to
speed up the processing training. Also, I am watching my memory usage
and I don't seem to be swapping (29GB/48GB being used right now). The
other thing is that I requested n_jobs = -1 so it could use all cores
of my machine (24 cores) but looking to my CPU usage, it doesn't seem
to be using any of them...
- would a 1E8 samples be doable with your implementation of random
forests (3 trees , 20 levels deep)?
- running this code on a cluster using different iPython engines? or
would that require a lot of work?
- PCA for dimensionality reduction? (on the paper, they haven't used
any dim reduction, so I am trying to avoid that)
- other implementations that I could use for large datasets?
PS: I am very new to this library but I am already impressed!! It's
one of the cleanest and probably most intuitive machine learning
libraries out there with a pretty impressive documentation and
tutorials. Pretty amazing work!!
Thank you very much,
Youssef
####################################
####################################
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
import time
import numpy as np
n_samples = 1000
n_features = 500
X, y = make_classification(n_samples, n_features, n_redundant=0,
n_informative=2,
random_state=1, n_clusters_per_class=1)
clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
'entropy', n_jobs = -1, verbose = 10)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
tic = time.time()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print 'Time taken:', time.time() - tic, 'seconds'
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Youssef Barhomi
2013-04-29 14:22:13 UTC
Permalink
Thank you Andreas!


On Sat, Apr 27, 2013 at 2:03 PM, Andreas Mueller
Post by Andreas Mueller
Hi Youssef.
I would strongly advise you to use a image specific random forest
implementation.
http://research.microsoft.com/en-us/downloads/03e0ca05-8aa9-49f6-801f-bb23846dc147/
It implements a much more complicated model, decision tree fields, but can
also be used for plain random forests.
Cheers,
Andy
Hello,
http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
different kinds of data (monkey depth maps instead of humans). So I am
generating my depth features and training and classifying data with a
random forest with quite similar parameters of the paper.
I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
samples with 500 features. Since it seems to be a large dataset of feature
vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
the last one seemed to be slower than a O(n_samples*n_features*log(n_samples))
http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
samples are taking a long time and I don't know when they will be done, I
would like better ways to estimate the ETA or find a way to speed up the
processing training. Also, I am watching my memory usage and I don't seem
to be swapping (29GB/48GB being used right now). The other thing is that I
requested n_jobs = -1 so it could use all cores of my machine (24 cores)
but looking to my CPU usage, it doesn't seem to be using any of them...
- would a 1E8 samples be doable with your implementation of random
forests (3 trees , 20 levels deep)?
- running this code on a cluster using different iPython engines? or would
that require a lot of work?
- PCA for dimensionality reduction? (on the paper, they haven't used any
dim reduction, so I am trying to avoid that)
- other implementations that I could use for large datasets?
PS: I am very new to this library but I am already impressed!! It's one
of the cleanest and probably most intuitive machine learning libraries out
there with a pretty impressive documentation and tutorials. Pretty amazing
work!!
Thank you very much,
Youssef
####################################
####################################
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
import time
import numpy as np
n_samples = 1000
n_features = 500
X, y = make_classification(n_samples, n_features, n_redundant=0,
n_informative=2,
random_state=1, n_clusters_per_class=1)
clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
'entropy', n_jobs = -1, verbose = 10)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
tic = time.time()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print 'Time taken:', time.time() - tic, 'seconds'
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
Continue reading on narkive:
Loading...