Discussion:
[Scikit-learn-general] pickled random forest file size, by design?
Dmitry Chichkov
2012-06-11 19:24:32 UTC
Permalink
I'm pickling a random forest model (128 estimators, trained on 50k
examples) and the resulting .pkl size is on the order of 200MB.
Is that expected? The whole dataset size is only 400k...

Here's the code that reproduces it:

import sklearn.ensemble, pickle
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=128)
clf.fit(X = [[i % 6, i % 7, i % 8] for i in range(50000)], y=[i % 5 > 0 for
i in range(50000)])
pickle.dump(clf, open("test.pkl", 'wb'))

Regards,
Dmitry
Gilles Louppe
2012-06-11 19:34:33 UTC
Permalink
Hi Dmitry,

If you want to reduce the size of the pickled forest, I would advise to:

1) Use HIGHEST_PROTOCOL with pickle:
pickle.dump(clf, open("test.pkl", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)

2) Adjust the value of min_samples_split in order to reduce the total
number of leaves, and hence the size of the tree.

This should help reducing the size of the file down to a few dozens of MB.

Best,

Gilles
I'm pickling a random forest model (128 estimators, trained on 50k examples)
and the resulting .pkl size is on the order of 200MB.
Is that expected?  The whole dataset size is only 400k...
import sklearn.ensemble, pickle
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=128)
clf.fit(X = [[i % 6, i % 7, i % 8] for i in range(50000)], y=[i % 5 > 0 for
i in range(50000)])
pickle.dump(clf, open("test.pkl", 'wb'))
Regards,
Dmitry
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Peter Prettenhofer
2012-06-11 19:40:11 UTC
Permalink
Hi Dmitry,

Gilles estimate was pretty close: if you use joblib for serialization
the resulting file size is 18MB::

from sklearn.externals import joblib
joblib.dump(clf, "test.pkl", 9)

best,
Peter
Post by Gilles Louppe
Hi Dmitry,
pickle.dump(clf, open("test.pkl", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
2) Adjust the value of min_samples_split in order to reduce the total
number of leaves, and hence the size of the tree.
This should help reducing the size of the file down to a few dozens of MB.
Best,
Gilles
I'm pickling a random forest model (128 estimators, trained on 50k examples)
and the resulting .pkl size is on the order of 200MB.
Is that expected?  The whole dataset size is only 400k...
import sklearn.ensemble, pickle
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=128)
clf.fit(X = [[i % 6, i % 7, i % 8] for i in range(50000)], y=[i % 5 > 0 for
i in range(50000)])
pickle.dump(clf, open("test.pkl", 'wb'))
Regards,
Dmitry
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Mathieu Blondel
2012-06-12 04:46:02 UTC
Permalink
On Tue, Jun 12, 2012 at 4:40 AM, Peter Prettenhofer <
Post by Peter Prettenhofer
joblib.dump(clf, "test.pkl", 9)
Note that the 3rd argument is the level of compression. You can also write
more explicitly:

joblib.dump(clf, "test.pkl", compress=9)

Mathieu
Emanuele Olivetti
2012-06-13 08:20:01 UTC
Permalink
Hi,

You can use gzip.open() instead of open() to add compression and to (possibly)
decrease the file size a lot - at least it did to me in a similar example:

import gzip
pickle.dump(clf, gzip.open("test.pkl", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)

# To retrieve:
clf = pickle.load(gzip.open("test.pkl"))

Best,

Emanuele

PS: If you try this could you send feedback on the resulting file size? Thanks!
Post by Gilles Louppe
Hi Dmitry,
pickle.dump(clf, open("test.pkl", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
2) Adjust the value of min_samples_split in order to reduce the total
number of leaves, and hence the size of the tree.
This should help reducing the size of the file down to a few dozens of MB.
Best,
Gilles
I'm pickling a random forest model (128 estimators, trained on 50k examples)
and the resulting .pkl size is on the order of 200MB.
Is that expected? The whole dataset size is only 400k...
import sklearn.ensemble, pickle
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=128)
clf.fit(X = [[i % 6, i % 7, i % 8] for i in range(50000)], y=[i % 5> 0 for
i in range(50000)])
pickle.dump(clf, open("test.pkl", 'wb'))
Regards,
Dmitry
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2012-06-13 08:52:31 UTC
Permalink
Post by Emanuele Olivetti
Hi,
You can use gzip.open() instead of open() to add compression and to (possibly)
import gzip
pickle.dump(clf, gzip.open("test.pkl", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
clf = pickle.load(gzip.open("test.pkl"))
Note that joblib can do this by passing a compression level to `dump`
as explained by @pprett and @mblondel. joblib pickler is smarter
(faster) than the default python pickler at serializing large
numerical arrays too.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Emanuele Olivetti
2012-06-15 09:51:38 UTC
Permalink
Post by Olivier Grisel
Post by Emanuele Olivetti
Hi,
You can use gzip.open() instead of open() to add compression and to (possibly)
import gzip
pickle.dump(clf, gzip.open("test.pkl", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
clf = pickle.load(gzip.open("test.pkl"))
Note that joblib can do this by passing a compression level to `dump`
(faster) than the default python pickler at serializing large
numerical arrays too.
I made some preliminary tests with 5000x5000 random matrix
and observed more or less the same results. I see that joblib
uses pickle + zlib + pickle.HIGHEST_PROTOCOL so it is not
big surprise. Are there settings in which joblib.dump()
is expected to provide larger gains?

Of course joblib.dump solution has much more concise syntax
than pickle+gzip, which is a welcome plus.

Best,

Emanuele

Continue reading on narkive:
Loading...