[Scikit-learn-general] Upcoming joblib release

Discussion:

Gael Varoquaux

2011-02-27 23:24:45 UTC

Hi,

I was looking at huge parallel for loops ran with joblib.Parallel (to be
precise, in the scikits.learn's GridSearchCV) and I realized that as joblib
was dispatching immediatly to sub-processes, it could create huge
temporaries. Thus I refactored the Parallel engine, to enable late

from math import sqrt
from joblib import Parallel, delayed

... for i in range(6):
... print 'Produced %s' % i
... yield i

out = Parallel(n_jobs=2, verbose=1, pre_dispatch='1.5*n_jobs')(

... delayed(sqrt)(i) for i in producer())
Produced 0
Produced 1
Produced 2
[Parallel(n_jobs=2)]: Done 1 out of 3+ |elapsed: 0.0s remaining: 0.0s
Produced 3
[Parallel(n_jobs=2)]: Done 2 out of 4+ |elapsed: 0.0s remaining: 0.0s
Produced 4
[Parallel(n_jobs=2)]: Done 3 out of 5+ |elapsed: 0.0s remaining: 0.0s
...

I am planning to release in a few days joblib 0.5.0 with this feature.
The release will also contain small improvements that make joblib's
caching engine more robust when used with many processes.

The soon-to-be-released code can be found in the 0.5.X branch.

I am planning to use this is the near future to improve parallelism in
the scikits.learn's GridSearchCV.

Any feedback is more than welcome.

Gael

Olivier Grisel

2011-02-28 11:22:46 UTC

Permalink

Post by Gael Varoquaux
Any feedback is more than welcome.

Good work :)

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Gael Varoquaux

2011-02-28 12:52:59 UTC

Permalink

Post by Olivier Grisel

Post by Gael Varoquaux
Any feedback is more than welcome.

Good work :)

Thanks. Your consideration warms my heart (sincerely).

Now, the big deal, as far as the scikit is considered: I have refactored
the GridSearchCV to be able to distribute not only different parameters to
the various CPUs, but also the different folds.

The reason I did this is that I am currently fitting a SVM to largish
data with a GridSearch and 3-fold cross-validation. I have a 12 CPU box,
and most of the time, most of the CPUs were not doing anything. Indeed,
a small number of parameter sets on the grid dominate the computation
time. It is often the case in my experience.

In the branch:
https://github.com/GaelVaroquaux/scikit-learn/tree/grid_search
each fold is fitted in parallel. Thus the different folds of the costly
grid points are dispatched across CPUs. For a 3 fold CV, this almost
gives a factor of 3 speedup on my box on my specific problem (as the
number of CPUs is large compared to the number of folds) and the
computational time is really dominated by 1 point.

The danger is to blow the memory by dispatching a huge amount of jobs
with different datasets. Thus the work I did in joblib with the
pre_dispatch :).

Now, I still need to be convinced that I haven't introduced a bug in the
way the scores are computed. I need to run this a bit more on my data. If
you want to give it a look/ a try, feedback is welcomed (and yes, I know,
the code for unrolling the parallel loop is hard to read :$ ).

G

Alexandre Gramfort

2011-03-01 02:10:55 UTC

Permalink

Post by Olivier Grisel

Post by Gael Varoquaux
Any feedback is more than welcome.

Good work :)

+1 :)

looking forward to test it !

Alex

xinfan meng

2011-03-03 03:01:09 UTC

Permalink

I saw this in
http://scikit-learn.sourceforge.net/dev/auto_examples/applications/wikipedia_principal_eigenvector.html

*# disabling joblib as the pickling of large dicts seems much too slow
#@memory.cache*

I wonder if it would be problematic when dealing with large dataset.

Also, I found this:
http://stackoverflow.com/questions/5082451/how-can-i-make-a-large-python-data-structure-more-efficient-to-unpickle
https://github.com/piskvorky/gensim/blob/master/src/gensim/corpora/wikicorpus.py

which might be relevent.

On Mon, Feb 28, 2011 at 7:24 AM, Gael Varoquaux <

Post by Gael Varoquaux
Hi,
I was looking at huge parallel for loops ran with joblib.Parallel (to be
precise, in the scikits.learn's GridSearchCV) and I realized that as joblib
was dispatching immediatly to sub-processes, it could create huge
temporaries. Thus I refactored the Parallel engine, to enable late

from math import sqrt
from joblib import Parallel, delayed

... print 'Produced %s' % i
... yield i

out = Parallel(n_jobs=2, verbose=1, pre_dispatch='1.5*n_jobs')(

... delayed(sqrt)(i) for i in producer())
Produced 0
Produced 1
Produced 2
[Parallel(n_jobs=2)]: Done 1 out of 3+ |elapsed: 0.0s remaining: 0.0s
Produced 3
[Parallel(n_jobs=2)]: Done 2 out of 4+ |elapsed: 0.0s remaining: 0.0s
Produced 4
[Parallel(n_jobs=2)]: Done 3 out of 5+ |elapsed: 0.0s remaining: 0.0s
...
I am planning to release in a few days joblib 0.5.0 with this feature.
The release will also contain small improvements that make joblib's
caching engine more robust when used with many processes.
The soon-to-be-released code can be found in the 0.5.X branch.
I am planning to use this is the near future to improve parallelism in
the scikits.learn's GridSearchCV.
Any feedback is more than welcome.
Gael

------------------------------------------------------------------------------

Post by Gael Varoquaux
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT

data

Post by Gael Varoquaux
generated by your applications, servers and devices whether physical,

virtual

Post by Gael Varoquaux
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Best Wishes
--------------------------------------------
Meng XinfanïŒèæ°æ³ïŒ
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

Gael Varoquaux

2011-03-03 06:43:35 UTC

Permalink

Interesting points that you raise. I should however note that they are
more related to the caching part of joblib than to the parallel computing
part. These parts I not yet heavily linked.

Post by xinfan meng
I saw this in
[1]http://scikit-learn.sourceforge.net/dev/auto_examples/applications/wikipedia_principal_eigenvector.html
# disabling joblib as the pickling of large dicts seems much too slow

Yes, dictionnaries will be slow. What will be really fast is anything in
which the data is stored in a small number of very large arrays.

Post by xinfan meng
I wonder if it would be problematic when dealing with large dataset.
[2]http://stackoverflow.com/questions/5082451/how-can-i-make-a-large-python-data-structure-more-efficient-to-unpickle

Out of the suggestions, some of them are to modify your objects so that
they return more compact structures when pickling. This is of course a
good suggestion, but that requires fixing each object.

Another one is to store to HDF5 using pytables. I think that this is a
great idea, and would probably improve performance significantly. We
cannot do it be default, as we cannot add an HDF5 dependency by default.
However, I would gladly welcome a pytables pickler in joblib (
https://github.com/joblib/joblib/blob/master/joblib/numpy_pickle.py
). I think that somebody (Pauli Virtanen?) has already coded a code part
of an HDF5 pickler.

Post by xinfan meng
[3]https://github.com/piskvorky/gensim/blob/master/src/gensim/corpora/wikicorpus.py
which might be relevent.

I don't really think that this very relevant to us: the trick employed
there is to build a very compact data structure that is very specific to
the task at hand. While this is a great idea and will definitely work, I
don't think that this is something that we can do in scikit-learn or
joblib. I think that it should be done by the users, who know their
usecases and their data.

Cheers,

Gaël

Post by xinfan meng
On Mon, Feb 28, 2011 at 7:24 AM, Gael Varoquaux

Post by Gael Varoquaux
Hi,
I was looking at huge parallel for loops ran with joblib.Parallel (to be
precise, in the scikits.learn's GridSearchCV) and I realized that as

joblib

Post by Gael Varoquaux
was dispatching immediatly to sub-processes, it could create huge
temporaries. Thus I refactored the Parallel engine, to enable late
>>> from math import sqrt
>>> from joblib import Parallel, delayed
... print 'Produced %s' % i
... yield i
>>> out = Parallel(n_jobs=2, verbose=1, pre_dispatch='1.5*n_jobs')(
... delayed(sqrt)(i) for i in producer())
Produced 0
Produced 1
Produced 2
[Parallel(n_jobs=2)]: Done 1 out of 3+ |elapsed: 0.0s remaining: 0.0s
Produced 3
[Parallel(n_jobs=2)]: Done 2 out of 4+ |elapsed: 0.0s remaining: 0.0s
Produced 4
[Parallel(n_jobs=2)]: Done 3 out of 5+ |elapsed: 0.0s remaining: 0.0s
...
I am planning to release in a few days joblib 0.5.0 with this feature.
The release will also contain small improvements that make joblib's
caching engine more robust when used with many processes.
The soon-to-be-released code can be found in the 0.5.X branch.
I am planning to use this is the near future to improve parallelism in
the scikits.learn's GridSearchCV.
Any feedback is more than welcome.
Gael

------------------------------------------------------------------------------

Post by Gael Varoquaux
Free Software Download: Index, Search & Analyze Logs and other IT data

Post by Gael Varoquaux
Real-Time with Splunk. Collect, index and harness all the fast moving IT

data

Post by Gael Varoquaux
generated by your applications, servers and devices whether physical,

virtual

Post by Gael Varoquaux
or in the cloud. Deliver compliance at lower cost and gain new business
insights. [5]http://p.sf.net/sfu/splunk-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[7]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Gael Varoquaux
Research Fellow, INSERM
Associate researcher, INRIA
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-78-35
Mobile: ++ 33-6-28-25-64-62
http://gael-varoquaux.info

Olivier Grisel

2011-03-03 10:26:31 UTC

Permalink

Post by Gael Varoquaux
Interesting points that you raise. I should however note that they are
more related to the caching part of joblib than to the parallel computing
part. These parts I not yet heavily linked.

I saw this in
[1]http://scikit-learn.sourceforge.net/dev/auto_examples/applications/wikipedia_principal_eigenvector.html
# disabling joblib as the pickling of large dicts seems much too slow

Yes, dictionnaries will be slow. What will be really fast is anything in
which the data is stored in a small number of very large arrays.

To give details on this use case: this dictionary is storing the
redirection table of Wikipedia (URL to URL). It's a couple of millions
of entries IIRC. I wonder why python is not able to perform a raw
memory dump the hash table rather that rehashing everything when
unpickling. Maybe they want the pickled output to be deterministic? A
better solution might be to store the dict as a btree (and not the
hash table used by the default dict implementation in python) that has
a deterministic order of key / value pairs. Unfortunately this does
not exist as part of the standard python library as far as I know.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Paolo Losi

2011-03-05 11:48:22 UTC

Permalink

Hi Olivier,

Post by Olivier Grisel
To give details on this use case: this dictionary is storing the
redirection table of Wikipedia (URL to URL). It's a couple of millions
of entries IIRC. I wonder why python is not able to perform a raw
memory dump the hash table rather that rehashing everything when
unpickling. Maybe they want the pickled output to be deterministic? A
better solution might be to store the dict as a btree (and not the
hash table used by the default dict implementation in python) that has
a deterministic order of key / value pairs. Unfortunately this does
not exist as part of the standard python library as far as I know.

for big dictionaries that are much more read than written, we've had very
good
experience with cdb [1] format and mmap.

we are using pure-cdb [2] with the C extension disabled (is somewhat buggy)
we than use marshal (that is faster than cPickle) for [de]serializing the
values.

The load time is almost O(1) and the lookup is very very fast.

mmap can also be used by numpy when mmap_mode is set to true.

[1] http://en.wikipedia.org/wiki/Cdb_(software)
[2] http://code.google.com/p/python-pure-cdb/

Paolo