[Scikit-learn-general] pickled random forest file size, by design?

Discussion:

Dmitry Chichkov

2012-06-11 19:24:32 UTC

Gilles Louppe

2012-06-11 19:34:33 UTC

Hi Dmitry,

If you want to reduce the size of the pickled forest, I would advise to:

1) Use HIGHEST_PROTOCOL with pickle:
pickle.dump(clf, open("test.pkl", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)

2) Adjust the value of min_samples_split in order to reduce the total
number of leaves, and hence the size of the tree.

This should help reducing the size of the file down to a few dozens of MB.

Best,

Gilles

I'm pickling a random forest model (128 estimators, trained on 50k examples)
and the resulting .pkl size is on the order of 200MB.
Is that expected? The whole dataset size is only 400k...
import sklearn.ensemble, pickle
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=128)
clf.fit(X = [[i % 6, i % 7, i % 8] for i in range(50000)], y=[i % 5 > 0 for
i in range(50000)])
pickle.dump(clf, open("test.pkl", 'wb'))
Regards,
Dmitry
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Peter Prettenhofer

2012-06-11 19:40:11 UTC

Permalink

Hi Dmitry,

Gilles estimate was pretty close: if you use joblib for serialization
the resulting file size is 18MB::

from sklearn.externals import joblib
joblib.dump(clf, "test.pkl", 9)

best,
Peter

Post by Gilles Louppe
Hi Dmitry,
pickle.dump(clf, open("test.pkl", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
2) Adjust the value of min_samples_split in order to reduce the total
number of leaves, and hence the size of the tree.
This should help reducing the size of the file down to a few dozens of MB.
Best,
Gilles

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer

Mathieu Blondel

2012-06-12 04:46:02 UTC

Permalink

On Tue, Jun 12, 2012 at 4:40 AM, Peter Prettenhofer <

Post by Peter Prettenhofer
joblib.dump(clf, "test.pkl", 9)

Note that the 3rd argument is the level of compression. You can also write
more explicitly:

joblib.dump(clf, "test.pkl", compress=9)

Mathieu

Emanuele Olivetti

2012-06-13 08:20:01 UTC

Permalink

Hi,

You can use gzip.open() instead of open() to add compression and to (possibly)
decrease the file size a lot - at least it did to me in a similar example:

import gzip
pickle.dump(clf, gzip.open("test.pkl", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)

# To retrieve:
clf = pickle.load(gzip.open("test.pkl"))

Best,

Emanuele

PS: If you try this could you send feedback on the resulting file size? Thanks!

I'm pickling a random forest model (128 estimators, trained on 50k examples)
and the resulting .pkl size is on the order of 200MB.
Is that expected? The whole dataset size is only 400k...
import sklearn.ensemble, pickle
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=128)
clf.fit(X = [[i % 6, i % 7, i % 8] for i in range(50000)], y=[i % 5> 0 for
i in range(50000)])
pickle.dump(clf, open("test.pkl", 'wb'))
Regards,
Dmitry
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Olivier Grisel

2012-06-13 08:52:31 UTC

Permalink

Post by Emanuele Olivetti
Hi,
You can use gzip.open() instead of open() to add compression and to (possibly)
import gzip
pickle.dump(clf, gzip.open("test.pkl", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
clf = pickle.load(gzip.open("test.pkl"))

Note that joblib can do this by passing a compression level to `dump`
as explained by @pprett and @mblondel. joblib pickler is smarter
(faster) than the default python pickler at serializing large
numerical arrays too.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Emanuele Olivetti

2012-06-15 09:51:38 UTC

Permalink

Post by Olivier Grisel

Note that joblib can do this by passing a compression level to `dump`
(faster) than the default python pickler at serializing large
numerical arrays too.

I made some preliminary tests with 5000x5000 random matrix
and observed more or less the same results. I see that joblib
uses pickle + zlib + pickle.HIGHEST_PROTOCOL so it is not
big surprise. Are there settings in which joblib.dump()
is expected to provide larger gains?

Of course joblib.dump solution has much more concise syntax
than pickle+gzip, which is a welcome plus.

Best,

Emanuele

Continue reading on narkive:

Search results for '[Scikit-learn-general] pickled random forest file size, by design?' (Questions and Answers)

replies

i need fun facts for my class!?

started 2007-08-05 14:44:18 UTC

homework help

replies

Random Facts! Give me one... some... anything! I must know ALL!?

started 2008-01-02 14:37:10 UTC

trivia

replies

What are the five boats used by Magellan in finding the Philippines?

started 2007-12-05 15:01:37 UTC

history