Lam Dang
2016-03-21 20:24:29 UTC
Hello scikit-learners,
Here is an idea to accelerate to accelerate parameters tuning for Random
Forest and Extra Trees. I am very interested if anyone know if the idea is
exploited somewhere or whether it makes sense.
Let's say we have a data set with train and validation (cross-validation
also works).
The process today of tuning Random Forest is to try different set of
parameters, check validation performance, reiterate and take the model with
best validation score in the end.
The idea to improve this process is:
- Fit the model once while growing all the trees to maximum, save this
model as a baseline
- For any set of parameters, the new model can be produced by reducing the
trees in the baseline model based on parameters. For example, for
max_depth=5, one can just remove all the nodes with depth higher than 5.
This process should be much faster than regrowing trees since it doen't
need to refit the model
- Use validation (or cross-validaiton) performance to choose best model as
usual.
This works (theoretically) because:
- For any parameters, the fitted trees will be just a part of the baseline
trees grown as maximum (except for criterion but it probably matters less)
- Trees are grown independant to each other (so this idea will not work for
GBM)
That's it. I am very interested in any feedback, whether it makes sense, of
it was done somewhere else already or whether it will work.
Best regards,
Lam Dang
Here is an idea to accelerate to accelerate parameters tuning for Random
Forest and Extra Trees. I am very interested if anyone know if the idea is
exploited somewhere or whether it makes sense.
Let's say we have a data set with train and validation (cross-validation
also works).
The process today of tuning Random Forest is to try different set of
parameters, check validation performance, reiterate and take the model with
best validation score in the end.
The idea to improve this process is:
- Fit the model once while growing all the trees to maximum, save this
model as a baseline
- For any set of parameters, the new model can be produced by reducing the
trees in the baseline model based on parameters. For example, for
max_depth=5, one can just remove all the nodes with depth higher than 5.
This process should be much faster than regrowing trees since it doen't
need to refit the model
- Use validation (or cross-validaiton) performance to choose best model as
usual.
This works (theoretically) because:
- For any parameters, the fitted trees will be just a part of the baseline
trees grown as maximum (except for criterion but it probably matters less)
- Trees are grown independant to each other (so this idea will not work for
GBM)
That's it. I am very interested in any feedback, whether it makes sense, of
it was done somewhere else already or whether it will work.
Best regards,
Lam Dang