Discussion:
[Scikit-learn-general] [scikit-learn-general] Why sklearn RandomForest model take a lot of disk space after save?
Piotr Płoński
2016-04-10 00:13:03 UTC
Permalink
Hi All,

I am saving RandomForestClassifier model from sklearn library with code
below

with open('/tmp/rf.model', 'wb') as f: cPickle.dump(RF_model, f)

It takes a lot of space on my hard drive. There are only 50 trees in the
model, however it takes over 50 MB on disk (analyzed dataset is ~ 20MB,
with 21 features). Does anybody have idea why? I observe similar behavior
for ExtraTreesClassifier.

Best,

Piotr
Joel Nothman
2016-04-10 12:25:40 UTC
Permalink
If you're running a random forest with default parameters (max_depth=None,
min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0, max_leaf_nodes=None), the
size of the tree will tend towards the size of the dataset. Change some of
these parameters to reduce overfitting and model size.
Post by Piotr Płoński
Hi All,
I am saving RandomForestClassifier model from sklearn library with code
below
with open('/tmp/rf.model', 'wb') as f: cPickle.dump(RF_model, f)
It takes a lot of space on my hard drive. There are only 50 trees in the
model, however it takes over 50 MB on disk (analyzed dataset is ~ 20MB,
with 21 features). Does anybody have idea why? I observe similar behavior
for ExtraTreesClassifier.
Best,
Piotr
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager
Applications Manager provides deep performance insights into multiple
tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2016-04-10 13:24:21 UTC
Permalink
You may also want to save your model using joblib (possibly with
compression enabled) instead of cPickle.

Mathieu
Post by Piotr Płoński
Hi All,
I am saving RandomForestClassifier model from sklearn library with code
below
with open('/tmp/rf.model', 'wb') as f: cPickle.dump(RF_model, f)
It takes a lot of space on my hard drive. There are only 50 trees in the
model, however it takes over 50 MB on disk (analyzed dataset is ~ 20MB,
with 21 features). Does anybody have idea why? I observe similar behavior
for ExtraTreesClassifier.
Best,
Piotr
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager
Applications Manager provides deep performance insights into multiple
tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Piotr Płoński
2016-04-10 13:28:05 UTC
Permalink
Thanks for comments! I put more details of my problem here
http://stackoverflow.com/questions/36523989/why-sklearn-randomforest-model-take-a-lot-of-disk-space-after-save


Indeed, saving with joblib takes less space but there is still a lot of
space used on the disk.

Best,
Piotr
Post by Mathieu Blondel
You may also want to save your model using joblib (possibly with
compression enabled) instead of cPickle.
Mathieu
Post by Piotr Płoński
Hi All,
I am saving RandomForestClassifier model from sklearn library with code
below
with open('/tmp/rf.model', 'wb') as f: cPickle.dump(RF_model, f)
It takes a lot of space on my hard drive. There are only 50 trees in the
model, however it takes over 50 MB on disk (analyzed dataset is ~ 20MB,
with 21 features). Does anybody have idea why? I observe similar behavior
for ExtraTreesClassifier.
Best,
Piotr
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager
Applications Manager provides deep performance insights into multiple
tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
<http://pubads.g.doubleclick.net/gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532>
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager
Applications Manager provides deep performance insights into multiple
tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas Mueller
2016-04-11 16:11:17 UTC
Permalink
Which version of scikit-learn are you using?
We recently (0.17) removed storing of data point indices in trees which
greatly reduced the size in some cases.
Post by Piotr Płoński
Thanks for comments! I put more details of my problem here
http://stackoverflow.com/questions/36523989/why-sklearn-randomforest-model-take-a-lot-of-disk-space-after-save
Indeed, saving with joblib takes less space but there is still a lot
of space used on the disk.
Best,
Piotr
You may also want to save your model using joblib (possibly with
compression enabled) instead of cPickle.
Mathieu
On Sun, Apr 10, 2016 at 9:13 AM, Piotr Płoński
Hi All,
I am saving RandomForestClassifier model from sklearn library
with code below
|
with open('/tmp/rf.model', 'wb') as f: cPickle.dump(RF_model, f)
|
||It takes a lot of space on my hard drive. There are only 50
trees in the model, however it takes over 50 MB on disk
(analyzed dataset is ~ 20MB, with 21 features). Does anybody
have idea why? I observe similar behavior for
ExtraTreesClassifier.
Best,
Piotr
------------------------------------------------------------------------------
Find and fix application performance issues faster with
Applications Manager
Applications Manager provides deep performance insights into
multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
<http://pubads.g.doubleclick.net/gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532>
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with
Applications Manager
Applications Manager provides deep performance insights into
multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
<http://pubads.g.doubleclick.net/%0Agampad/clk?id=1444514301&iu=/ca-pub-7940484522588532>
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Piotr Płoński
2016-04-11 16:17:53 UTC
Permalink
I am using 0.17.1, did you consider writing custom save methods for this
classifier?
Post by Andreas Mueller
Which version of scikit-learn are you using?
We recently (0.17) removed storing of data point indices in trees which
greatly reduced the size in some cases.
Thanks for comments! I put more details of my problem here
<http://stackoverflow.com/questions/36523989/why-sklearn-randomforest-model-take-a-lot-of-disk-space-after-save>
http://stackoverflow.com/questions/36523989/why-sklearn-randomforest-model-take-a-lot-of-disk-space-after-save
Indeed, saving with joblib takes less space but there is still a lot of
space used on the disk.
Best,
Piotr
Post by Mathieu Blondel
You may also want to save your model using joblib (possibly with
compression enabled) instead of cPickle.
Mathieu
Post by Piotr Płoński
Hi All,
I am saving RandomForestClassifier model from sklearn library with code
below
with open('/tmp/rf.model', 'wb') as f: cPickle.dump(RF_model, f)
It takes a lot of space on my hard drive. There are only 50 trees in the
model, however it takes over 50 MB on disk (analyzed dataset is ~ 20MB,
with 21 features). Does anybody have idea why? I observe similar behavior
for ExtraTreesClassifier.
Best,
Piotr
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
<http://pubads.g.doubleclick.net/gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532>
http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
<http://pubads.g.doubleclick.net/%0Agampad/clk?id=1444514301&iu=/ca-pub-7940484522588532>
http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Sebastian Raschka
2016-04-11 16:47:12 UTC
Permalink
Just curious how it could be made more efficient. ~14.9 Mb for 50 trees on a 20 mb dataset doesn't sound too bad actually since we are not pruning the trees in Random Forests. Sth I could think would be to summarize similar trees in buckets or building a "fragment" library of shared decision rules. However, I am not sure how much effort it would be to implement such a thing plus the computational efficiency may suffer. Hm, I am curious, how large would 1 single, fully grown decision tree be based on your dataset?
I am using 0.17.1, did you consider writing custom save methods for this classifier?
Which version of scikit-learn are you using?
We recently (0.17) removed storing of data point indices in trees which greatly reduced the size in some cases.
Thanks for comments! I put more details of my problem here <http://stackoverflow.com/questions/36523989/why-sklearn-randomforest-model-take-a-lot-of-disk-space-after-save>http://stackoverflow.com/questions/36523989/why-sklearn-randomforest-model-take-a-lot-of-disk-space-after-save <http://stackoverflow.com/questions/36523989/why-sklearn-randomforest-model-take-a-lot-of-disk-space-after-save>
Indeed, saving with joblib takes less space but there is still a lot of space used on the disk.
Best,
Piotr
You may also want to save your model using joblib (possibly with compression enabled) instead of cPickle.
Mathieu
Hi All,
I am saving RandomForestClassifier model from sklearn library with code below
cPickle.dump(RF_model, f)
It takes a lot of space on my hard drive. There are only 50 trees in the model, however it takes over 50 MB on disk (analyzed dataset is ~ 20MB, with 21 features). Does anybody have idea why? I observe similar behavior for ExtraTreesClassifier.
Best,
Piotr
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! <http://pubads.g.doubleclick.net/gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532>http://pubads.g.doubleclick.net/ <http://pubads.g.doubleclick.net/>
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! <http://pubads.g.doubleclick.net/%0Agampad/clk?id=1444514301&iu=/ca-pub-7940484522588532>http://pubads.g.doubleclick.net/ <http://pubads.g.doubleclick.net/>
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/ <http://pubads.g.doubleclick.net/>
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532 <http://pubads.g.doubleclick.net/gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532>
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Joel Nothman
2016-04-12 01:28:45 UTC
Permalink
Yes, there are no doubt more efficient ways to store forests, but it
seems unlikely to be a worthwhile investment.

I think this is a documentation rather than an engineering issue. We
frequently get issues raised that relate to "size": runtime, memory
consumption, model size on disk, (in)effectiveness of parallelism.

We could provide methods on models that estimate these costs (analytically
or, indeed, via a pre-fit GP regressor!), but merely documenting them more
clearly up front in the general case (even just "parameters can affect
model size drastically") would be worthwhile.
Post by Sebastian Raschka
Just curious how it could be made more efficient. ~14.9 Mb for 50 trees on
a 20 mb dataset doesn't sound too bad actually since we are not pruning the
trees in Random Forests. Sth I could think would be to summarize similar
trees in buckets or building a "fragment" library of shared decision rules.
However, I am not sure how much effort it would be to implement such a
thing plus the computational efficiency may suffer. Hm, I am curious, how
large would 1 single, fully grown decision tree be based on your dataset?
I am using 0.17.1, did you consider writing custom save methods for this classifier?
Post by Andreas Mueller
Which version of scikit-learn are you using?
We recently (0.17) removed storing of data point indices in trees which
greatly reduced the size in some cases.
Thanks for comments! I put more details of my problem here
<http://stackoverflow.com/questions/36523989/why-sklearn-randomforest-model-take-a-lot-of-disk-space-after-save>
http://stackoverflow.com/questions/36523989/why-sklearn-randomforest-model-take-a-lot-of-disk-space-after-save
Indeed, saving with joblib takes less space but there is still a lot of
space used on the disk.
Best,
Piotr
Post by Mathieu Blondel
You may also want to save your model using joblib (possibly with
compression enabled) instead of cPickle.
Mathieu
Post by Piotr Płoński
Hi All,
I am saving RandomForestClassifier model from sklearn library with code below
with open('/tmp/rf.model', 'wb') as f: cPickle.dump(RF_model, f)
It takes a lot of space on my hard drive. There are only 50 trees in
the model, however it takes over 50 MB on disk (analyzed dataset is ~ 20MB,
with 21 features). Does anybody have idea why? I observe similar behavior
for ExtraTreesClassifier.
Best,
Piotr
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
<http://pubads.g.doubleclick.net/gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532>
http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
<http://pubads.g.doubleclick.net/%0Agampad/clk?id=1444514301&iu=/ca-pub-7940484522588532>
http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
<http://pubads.g.doubleclick.net/gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532>
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...