Lam Dang

2016-03-21 20:24:29 UTC

Hello scikit-learners,

Here is an idea to accelerate to accelerate parameters tuning for Random

Forest and Extra Trees. I am very interested if anyone know if the idea is

exploited somewhere or whether it makes sense.

Let's say we have a data set with train and validation (cross-validation

also works).

The process today of tuning Random Forest is to try different set of

parameters, check validation performance, reiterate and take the model with

best validation score in the end.

The idea to improve this process is:

- Fit the model once while growing all the trees to maximum, save this

model as a baseline

- For any set of parameters, the new model can be produced by reducing the

trees in the baseline model based on parameters. For example, for

max_depth=5, one can just remove all the nodes with depth higher than 5.

This process should be much faster than regrowing trees since it doen't

need to refit the model

- Use validation (or cross-validaiton) performance to choose best model as

usual.

This works (theoretically) because:

- For any parameters, the fitted trees will be just a part of the baseline

trees grown as maximum (except for criterion but it probably matters less)

- Trees are grown independant to each other (so this idea will not work for

GBM)

That's it. I am very interested in any feedback, whether it makes sense, of

it was done somewhere else already or whether it will work.

Best regards,

Lam Dang

Here is an idea to accelerate to accelerate parameters tuning for Random

Forest and Extra Trees. I am very interested if anyone know if the idea is

exploited somewhere or whether it makes sense.

Let's say we have a data set with train and validation (cross-validation

also works).

The process today of tuning Random Forest is to try different set of

parameters, check validation performance, reiterate and take the model with

best validation score in the end.

The idea to improve this process is:

- Fit the model once while growing all the trees to maximum, save this

model as a baseline

- For any set of parameters, the new model can be produced by reducing the

trees in the baseline model based on parameters. For example, for

max_depth=5, one can just remove all the nodes with depth higher than 5.

This process should be much faster than regrowing trees since it doen't

need to refit the model

- Use validation (or cross-validaiton) performance to choose best model as

usual.

This works (theoretically) because:

- For any parameters, the fitted trees will be just a part of the baseline

trees grown as maximum (except for criterion but it probably matters less)

- Trees are grown independant to each other (so this idea will not work for

GBM)

That's it. I am very interested in any feedback, whether it makes sense, of

it was done somewhere else already or whether it will work.

Best regards,

Lam Dang