[Scikit-learn-general] Speed up Random Forest/ Extra Trees tuning

Discussion:

Lam Dang

2016-03-21 20:24:29 UTC

Hello scikit-learners,

Here is an idea to accelerate to accelerate parameters tuning for Random
Forest and Extra Trees. I am very interested if anyone know if the idea is
exploited somewhere or whether it makes sense.

Let's say we have a data set with train and validation (cross-validation
also works).

The process today of tuning Random Forest is to try different set of
parameters, check validation performance, reiterate and take the model with
best validation score in the end.

The idea to improve this process is:
- Fit the model once while growing all the trees to maximum, save this
model as a baseline
- For any set of parameters, the new model can be produced by reducing the
trees in the baseline model based on parameters. For example, for
max_depth=5, one can just remove all the nodes with depth higher than 5.
This process should be much faster than regrowing trees since it doen't
need to refit the model
- Use validation (or cross-validaiton) performance to choose best model as
usual.

This works (theoretically) because:
- For any parameters, the fitted trees will be just a part of the baseline
trees grown as maximum (except for criterion but it probably matters less)
- Trees are grown independant to each other (so this idea will not work for
GBM)

That's it. I am very interested in any feedback, whether it makes sense, of
it was done somewhere else already or whether it will work.

Best regards,
Lam Dang

Jacob Schreiber

2016-03-21 20:42:46 UTC

Permalink

Hi Lam

The idea of exploiting redundancies to speed up algorithms is a good
intuition. However, I don't think that most attributes would be able to be
done in this manner. For example, considering different numbers of max
features in the splits would be difficult to calculate without storing all
possible splits at each node and just reducing the set of considered ones.
And since all splits depend on the split before them, it may be difficult
to modify splits in the middle of the tree without simply regrowing them
(such as changing the feature the tree was split on.)

Jacob

Post by Lam Dang
Hello scikit-learners,
Here is an idea to accelerate to accelerate parameters tuning for Random
Forest and Extra Trees. I am very interested if anyone know if the idea is
exploited somewhere or whether it makes sense.
Let's say we have a data set with train and validation (cross-validation
also works).
The process today of tuning Random Forest is to try different set of
parameters, check validation performance, reiterate and take the model with
best validation score in the end.
- Fit the model once while growing all the trees to maximum, save this
model as a baseline
- For any set of parameters, the new model can be produced by reducing the
trees in the baseline model based on parameters. For example, for
max_depth=5, one can just remove all the nodes with depth higher than 5.
This process should be much faster than regrowing trees since it doen't
need to refit the model
- Use validation (or cross-validaiton) performance to choose best model as
usual.
- For any parameters, the fitted trees will be just a part of the baseline
trees grown as maximum (except for criterion but it probably matters less)
- Trees are grown independant to each other (so this idea will not work
for GBM)
That's it. I am very interested in any feedback, whether it makes sense,
of it was done somewhere else already or whether it will work.
Best regards,
Lam Dang
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Lam Dang

2016-03-21 21:19:25 UTC

Permalink

Hi Jacob,

Thanks for your answer. Indeed you are right, some parameters cannot be
adjusted off-data. Let's go through the parameters list to see which one
can be adjusted:
n_estimators : this is simple - the more the better
criterion : No
max_features : No
max_depth : Yes
min_samples_split : Yes
min_samples_leaf : Yes
min_weight_fraction_leaf : Yes
max_leaf_nodes Yes
bootstrap : No

So basically the bootstrap-related parameters cannot be adjusted, while
tree parameters can. It should still speed up the search, right?
Best,
Lam

Post by Jacob Schreiber
Hi Lam
The idea of exploiting redundancies to speed up algorithms is a good
intuition. However, I don't think that most attributes would be able to be
done in this manner. For example, considering different numbers of max
features in the splits would be difficult to calculate without storing all
possible splits at each node and just reducing the set of considered ones.
And since all splits depend on the split before them, it may be difficult
to modify splits in the middle of the tree without simply regrowing them
(such as changing the feature the tree was split on.)
Jacob

Post by Lam Dang
Hello scikit-learners,
Here is an idea to accelerate to accelerate parameters tuning for Random
Forest and Extra Trees. I am very interested if anyone know if the idea is
exploited somewhere or whether it makes sense.
Let's say we have a data set with train and validation (cross-validation
also works).
The process today of tuning Random Forest is to try different set of
parameters, check validation performance, reiterate and take the model with
best validation score in the end.
- Fit the model once while growing all the trees to maximum, save this
model as a baseline
- For any set of parameters, the new model can be produced by reducing
the trees in the baseline model based on parameters. For example, for
max_depth=5, one can just remove all the nodes with depth higher than 5.
This process should be much faster than regrowing trees since it doen't
need to refit the model
- Use validation (or cross-validaiton) performance to choose best model
as usual.
- For any parameters, the fitted trees will be just a part of the
baseline trees grown as maximum (except for criterion but it probably
matters less)
- Trees are grown independant to each other (so this idea will not work
for GBM)
That's it. I am very interested in any feedback, whether it makes sense,
of it was done somewhere else already or whether it will work.
Best regards,
Lam Dang
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Jacob Schreiber

2016-03-21 21:32:22 UTC

Permalink

It should if you're using those parameters. It's basically similar to
calculating the regularization path for LASSO, since these are also
regularization terms. I think this would probably be a good addition if
there was a clean implementation for it.

Post by Lam Dang
Hi Jacob,
Thanks for your answer. Indeed you are right, some parameters cannot be
adjusted off-data. Let's go through the parameters list to see which one
n_estimators : this is simple - the more the better
criterion : No
max_features : No
max_depth : Yes
min_samples_split : Yes
min_samples_leaf : Yes
min_weight_fraction_leaf : Yes
max_leaf_nodes Yes
bootstrap : No
So basically the bootstrap-related parameters cannot be adjusted, while
tree parameters can. It should still speed up the search, right?
Best,
Lam

Post by Lam Dang
Hello scikit-learners,
Here is an idea to accelerate to accelerate parameters tuning for Random
Forest and Extra Trees. I am very interested if anyone know if the idea is
exploited somewhere or whether it makes sense.
Let's say we have a data set with train and validation (cross-validation
also works).
The process today of tuning Random Forest is to try different set of
parameters, check validation performance, reiterate and take the model with
best validation score in the end.
- Fit the model once while growing all the trees to maximum, save this
model as a baseline
- For any set of parameters, the new model can be produced by reducing
the trees in the baseline model based on parameters. For example, for
max_depth=5, one can just remove all the nodes with depth higher than 5.
This process should be much faster than regrowing trees since it
doen't need to refit the model
- Use validation (or cross-validaiton) performance to choose best model
as usual.
- For any parameters, the fitted trees will be just a part of the
baseline trees grown as maximum (except for criterion but it probably
matters less)
- Trees are grown independant to each other (so this idea will not work
for GBM)
That's it. I am very interested in any feedback, whether it makes sense,
of it was done somewhere else already or whether it will work.
Best regards,
Lam Dang
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Mathieu Blondel

2016-03-22 00:44:35 UTC

Permalink

Related issue:
https://github.com/scikit-learn/scikit-learn/issues/3652

Post by Jacob Schreiber
It should if you're using those parameters. It's basically similar to
calculating the regularization path for LASSO, since these are also
regularization terms. I think this would probably be a good addition if
there was a clean implementation for it.

Post by Lam Dang
Hello scikit-learners,
Here is an idea to accelerate to accelerate parameters tuning for
Random Forest and Extra Trees. I am very interested if anyone know if the
idea is exploited somewhere or whether it makes sense.
Let's say we have a data set with train and validation
(cross-validation also works).
The process today of tuning Random Forest is to try different set of
parameters, check validation performance, reiterate and take the model with
best validation score in the end.
- Fit the model once while growing all the trees to maximum, save this
model as a baseline
- For any set of parameters, the new model can be produced by reducing
the trees in the baseline model based on parameters. For example, for
max_depth=5, one can just remove all the nodes with depth higher than 5.
This process should be much faster than regrowing trees since it
doen't need to refit the model
- Use validation (or cross-validaiton) performance to choose best model
as usual.
- For any parameters, the fitted trees will be just a part of the
baseline trees grown as maximum (except for criterion but it probably
matters less)
- Trees are grown independant to each other (so this idea will not work
for GBM)
That's it. I am very interested in any feedback, whether it makes
sense, of it was done somewhere else already or whether it will work.
Best regards,
Lam Dang
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gilles Louppe

2016-03-22 07:27:11 UTC

Permalink

Unfortunately, the most important parameters to adjust to maximize
accuracy are often those controlling the randomness in the algorithm,
i.e. max_features for which this strategy is not possible.

That being said, in the case of boosting, I think this strategy would
be worth automatizing, e.g. to adjust the number of trees.

Gilles

Post by Mathieu Blondel
https://github.com/scikit-learn/scikit-learn/issues/3652

Post by Lam Dang
Hi Jacob,
Thanks for your answer. Indeed you are right, some parameters cannot be
adjusted off-data. Let's go through the parameters list to see which one can
n_estimators : this is simple - the more the better
criterion : No
max_features : No
max_depth : Yes
min_samples_split : Yes
min_samples_leaf : Yes
min_weight_fraction_leaf : Yes
max_leaf_nodes Yes
bootstrap : No
So basically the bootstrap-related parameters cannot be adjusted, while
tree parameters can. It should still speed up the search, right?
Best,
Lam

Post by Jacob Schreiber
Hi Lam
The idea of exploiting redundancies to speed up algorithms is a good
intuition. However, I don't think that most attributes would be able to be
done in this manner. For example, considering different numbers of max
features in the splits would be difficult to calculate without storing all
possible splits at each node and just reducing the set of considered ones.
And since all splits depend on the split before them, it may be difficult to
modify splits in the middle of the tree without simply regrowing them (such
as changing the feature the tree was split on.)
Jacob

Post by Lam Dang
Hello scikit-learners,
Here is an idea to accelerate to accelerate parameters tuning for
Random Forest and Extra Trees. I am very interested if anyone know if the
idea is exploited somewhere or whether it makes sense.
Let's say we have a data set with train and validation
(cross-validation also works).
The process today of tuning Random Forest is to try different set of
parameters, check validation performance, reiterate and take the model with
best validation score in the end.
- Fit the model once while growing all the trees to maximum, save this
model as a baseline
- For any set of parameters, the new model can be produced by reducing
the trees in the baseline model based on parameters. For example, for
max_depth=5, one can just remove all the nodes with depth higher than 5.
This process should be much faster than regrowing trees since it
doen't need to refit the model
- Use validation (or cross-validaiton) performance to choose best model
as usual.
- For any parameters, the fitted trees will be just a part of the
baseline trees grown as maximum (except for criterion but it probably
matters less)
- Trees are grown independant to each other (so this idea will not work
for GBM)
That's it. I am very interested in any feedback, whether it makes
sense, of it was done somewhere else already or whether it will work.
Best regards,
Lam Dang
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Lam Dang

2016-03-22 11:41:10 UTC

Permalink

Interesting,

Yes max_features is probably the most important parameter. However those
other parameters may have big contribution to reduce overfitting too.
I would probably make some test but I am not experienced with the low level
API of scikit-learn.

Any experimented scikit-learn contributors want to collaborate?

Post by Gilles Louppe
Unfortunately, the most important parameters to adjust to maximize
accuracy are often those controlling the randomness in the algorithm,
i.e. max_features for which this strategy is not possible.
That being said, in the case of boosting, I think this strategy would
be worth automatizing, e.g. to adjust the number of trees.
Gilles

Post by Mathieu Blondel
https://github.com/scikit-learn/scikit-learn/issues/3652
On Tue, Mar 22, 2016 at 6:32 AM, Jacob Schreiber <

Post by Lam Dang
Hi Jacob,
Thanks for your answer. Indeed you are right, some parameters cannot be
adjusted off-data. Let's go through the parameters list to see which

one can