[Scikit-learn-general] Joblib dump memory error

Discussion:

2012-11-17 06:25:37 UTC

Hello,
I am dumping the dataset vectorized with TfidfVectorizer, target array, and
the classifier OneVsRestClassifierSGDClassifier(loss=log, n_iter=50,
alpha=0.00001)), since I want to add it to a package. I use joblib library from
sklearn.externals to dump the vectors. The max memory used when training the
classifier is 12g, however, when the program starts dumping classifier the usage
jumps to 38g (which I assume is due to some internal copy?). I have about 32g of
RAM, so is there a better way to store the classifier instead of using
joblib.dump(compress=9)? [I tried values compress=3, 5, 7, 9, always get memory
error]. If I do not compress the vectors total to about 11g.
Thanks

Olivier Grisel

2012-11-17 13:45:20 UTC

Permalink

The problem is likely the `vocabulary_` python dict of the
CountVectorizer. It's pickled using the default python pickler which
is probably not very efficient.

Anyway for large text data, using a hashing vectorizer would be a much
better solution.

You can follow progress on this branch that should be soon merged in
master: https://github.com/scikit-learn/scikit-learn/pull/909

And maybe later a HashingTextVectorizer that will directly take text
data as input and apply tokenization + token / char n-gram
vectorization using a hash function instead of a python dict to handle
the feature name to feature index mapping.

2012-11-19 20:53:05 UTC

Permalink

Post by Olivier Grisel
The problem is likely the `vocabulary_` python dict of the
CountVectorizer. It's pickled using the default python pickler which
is probably not very efficient.
Anyway for large text data, using a hashing vectorizer would be a much
better solution.
You can follow progress on this branch that should be soon merged in
master: https://github.com/scikit-learn/scikit-learn/pull/909
And maybe later a HashingTextVectorizer that will directly take text
data as input and apply tokenization + token / char n-gram
vectorization using a hash function instead of a python dict to handle
the feature name to feature index mapping.

Sounds like a efficient way to vectorize the input. However I face the memory
error when dumping the classifier object with compression on. [I already dump
the vectorizer and target array with joblib.dump(compress=9) and that seems to
go fine].

Olivier Grisel

2012-11-19 21:01:51 UTC

Permalink

Can you please open a github issue with a standalone minimalistic
script (possibly with randomly generated data or an small excerpt of
your data)?

You can use http://gist.github.com as a temporary github repo to push
both scripts and data in one place.