Discussion:
GradientBoosting and GridSearchCV: how?
(too old to reply)
Emanuele Olivetti
2012-06-21 15:24:51 UTC
Permalink
Dear All,

I am interested in attempting model selection with GridSearchCV() on
GradientBoostingRegressor(). I am quite new to boosting but I see from
the nice examples in sklearn documentation [0] that once the n_estimator
is fixed, it is possible to evaluate the classifiers at each boosting
iteration through GradientBoostingRegressor.staged_decision_function()
and similar things (oob_score_, staged_predict).

As in the figures of the examples the score on the test set, e.g. deviance,
has sometimes a minimum and it would be nice to get it during model selection
in order to score the given set of parameter values on it. How to do that within
GridSearchCV?

What I would like to do is to define sets of GradientBoosting parameter
values, e.g.
{'learn_rate':[0.05, 0.01, 0.001], 'subsample':[0.25, 0.5, 0.75], ...ecc.}
and then to do grid search to decide which set of values gives the minimum
score, e.g. mse, in the minimum of the related graph "score vs boosting iteration".
Moreover it would be great to keep track of at which boosting iteration
this minimum occurs.

I am reading the documentation but I cannot understand how to do that. Could
you help me?

Best,

Emanuele


[0]:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html#example-ensemble-plot-gradient-boosting-regression-py
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html#example-ensemble-plot-gradient-boosting-regularization-py
Peter Prettenhofer
2012-06-21 16:04:38 UTC
Permalink
Post by Emanuele Olivetti
Dear All,
I am interested in attempting model selection with GridSearchCV() on
GradientBoostingRegressor(). I am quite new to boosting but I see from
the nice examples in sklearn documentation [0] that once the n_estimator
is fixed, it is possible to evaluate the classifiers at each boosting
iteration through GradientBoostingRegressor.staged_decision_function()
and similar things (oob_score_, staged_predict).
As in the figures of the examples the score on the test set, e.g. deviance,
has sometimes a minimum and it would be nice to get it during model selection
in order to score the given set of parameter values on it. How to do that within
GridSearchCV?
Hi Emanuele,

there is no straight-forward solution to this yet but it can be
accomplished by overwritting/monkey-batching
``GradientBoostingRegressor.score``.
Within ``score`` you call ``self.staged_predict`` and select the best
boosting iteration and return the result of the evaluation metric you
choose.

Here is a quick example (not tested)::

def custom_score(self, X, y_true):
scores = [mse(y_true, y_pred) for y_pred in self.staged_predict(X)]
best_iter = np.argmin(scores)
best_score = scores[best_iter]

# set model to best iteration
self.n_estimator = best_iter + 1
self.estimators_ = self.estimators_[:self.n_estimators]

return best_score

GradientBoostingRegressor.score = custom_score

The drawback of this approch is that you cannot use the ``score_func``
and ``loss_func`` arguments of ``GridSearchCV`` because if they are
set ``GridSearchCV`` will use them instead of ``estimator.score``

I'm currently working on a PR with extends the functionality of the
gradient boosting module including some convenience methods for
finding the "optimal" number of estimators (=iterations). I'll keep
you posted.

best,
Peter
Post by Emanuele Olivetti
What I would like to do is to define sets of GradientBoosting parameter
values, e.g.
{'learn_rate':[0.05, 0.01, 0.001], 'subsample':[0.25, 0.5, 0.75], ...ecc.}
and then to do grid search to decide which set of values gives the minimum
score, e.g. mse, in the minimum of the related graph "score vs boosting iteration".
Moreover it would be great to keep track of at which boosting iteration
this minimum occurs.
I am reading the documentation but I cannot understand how to do that. Could
you help me?
Best,
Emanuele
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html#example-ensemble-plot-gradient-boosting-regression-py
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html#example-ensemble-plot-gradient-boosting-regularization-py
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Andreas Mueller
2012-06-21 16:08:01 UTC
Permalink
Hi Emanuele, hi Peter.
@Emanuele: You could also try to use IterGrid instead of GridSearchCV.
This might mean having to do some things by hand, but should work.

@Peter: Could your improvements also be applied to RandomForests
an the oob score? Having a method there would also be quite nice.

Cheers,
Andy
Post by Peter Prettenhofer
Post by Emanuele Olivetti
Dear All,
I am interested in attempting model selection with GridSearchCV() on
GradientBoostingRegressor(). I am quite new to boosting but I see from
the nice examples in sklearn documentation [0] that once the n_estimator
is fixed, it is possible to evaluate the classifiers at each boosting
iteration through GradientBoostingRegressor.staged_decision_function()
and similar things (oob_score_, staged_predict).
As in the figures of the examples the score on the test set, e.g. deviance,
has sometimes a minimum and it would be nice to get it during model selection
in order to score the given set of parameter values on it. How to do that within
GridSearchCV?
Hi Emanuele,
there is no straight-forward solution to this yet but it can be
accomplished by overwritting/monkey-batching
``GradientBoostingRegressor.score``.
Within ``score`` you call ``self.staged_predict`` and select the best
boosting iteration and return the result of the evaluation metric you
choose.
scores = [mse(y_true, y_pred) for y_pred in self.staged_predict(X)]
best_iter = np.argmin(scores)
best_score = scores[best_iter]
# set model to best iteration
self.n_estimator = best_iter + 1
self.estimators_ = self.estimators_[:self.n_estimators]
return best_score
GradientBoostingRegressor.score = custom_score
The drawback of this approch is that you cannot use the ``score_func``
and ``loss_func`` arguments of ``GridSearchCV`` because if they are
set ``GridSearchCV`` will use them instead of ``estimator.score``
I'm currently working on a PR with extends the functionality of the
gradient boosting module including some convenience methods for
finding the "optimal" number of estimators (=iterations). I'll keep
you posted.
best,
Peter
Post by Emanuele Olivetti
What I would like to do is to define sets of GradientBoosting parameter
values, e.g.
{'learn_rate':[0.05, 0.01, 0.001], 'subsample':[0.25, 0.5, 0.75], ...ecc.}
and then to do grid search to decide which set of values gives the minimum
score, e.g. mse, in the minimum of the related graph "score vs boosting iteration".
Moreover it would be great to keep track of at which boosting iteration
this minimum occurs.
I am reading the documentation but I cannot understand how to do that. Could
you help me?
Best,
Emanuele
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html#example-ensemble-plot-gradient-boosting-regression-py
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html#example-ensemble-plot-gradient-boosting-regularization-py
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Peter Prettenhofer
2012-06-21 16:16:38 UTC
Permalink
Post by Andreas Mueller
Hi Emanuele, hi Peter.
@Emanuele: You could also try to use IterGrid instead of GridSearchCV.
This might mean having to do some things by hand, but should work.
@Peter: Could your improvements also be applied to RandomForests
an the oob score? Having a method there would also be quite nice.
Hi Andy,

the convenience method** could also be applied to RandomForest. It
should be possible to use either OOB score, CV error, or held-out
error as the criterion.

** similar to GBMs optimize or optimal routine.. don't know the exact name...

best,
Peter
Post by Andreas Mueller
Cheers,
Andy
Post by Peter Prettenhofer
Post by Emanuele Olivetti
Dear All,
I am interested in attempting model selection with GridSearchCV() on
GradientBoostingRegressor(). I am quite new to boosting but I see from
the nice examples in sklearn documentation [0] that once the n_estimator
is fixed, it is possible to evaluate the classifiers at each boosting
iteration through GradientBoostingRegressor.staged_decision_function()
and similar things (oob_score_, staged_predict).
As in the figures of the examples the score on the test set, e.g. deviance,
has sometimes a minimum and it would be nice to get it during model selection
in order to score the given set of parameter values on it. How to do that within
GridSearchCV?
Hi Emanuele,
there is no straight-forward solution to this yet but it can be
accomplished by overwritting/monkey-batching
``GradientBoostingRegressor.score``.
Within ``score`` you call ``self.staged_predict`` and select the best
boosting iteration and return the result of the evaluation metric you
choose.
     scores = [mse(y_true, y_pred) for y_pred in self.staged_predict(X)]
     best_iter = np.argmin(scores)
     best_score = scores[best_iter]
     # set model to best iteration
     self.n_estimator = best_iter + 1
     self.estimators_ = self.estimators_[:self.n_estimators]
     return best_score
GradientBoostingRegressor.score = custom_score
The drawback of this approch is that you cannot use the ``score_func``
and ``loss_func`` arguments of ``GridSearchCV`` because if they are
set  ``GridSearchCV`` will use them instead of ``estimator.score``
I'm currently working on a PR with extends the functionality of the
gradient boosting module including some convenience methods for
finding the "optimal" number of estimators (=iterations). I'll keep
you posted.
best,
  Peter
Post by Emanuele Olivetti
What I would like to do is to define sets of GradientBoosting parameter
values, e.g.
{'learn_rate':[0.05, 0.01, 0.001], 'subsample':[0.25, 0.5, 0.75], ...ecc.}
and then to do grid search to decide which set of values gives the minimum
score, e.g. mse, in the minimum of the related graph "score vs boosting iteration".
Moreover it would be great to keep track of at which boosting iteration
this minimum occurs.
I am reading the documentation but I cannot understand how to do that. Could
you help me?
Best,
Emanuele
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html#example-ensemble-plot-gradient-boosting-regression-py
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html#example-ensemble-plot-gradient-boosting-regularization-py
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Emanuele Olivetti
2012-06-21 22:17:09 UTC
Permalink
Post by Andreas Mueller
Hi Emanuele, hi Peter.
@Emanuele: You could also try to use IterGrid instead of GridSearchCV.
This might mean having to do some things by hand, but should work.
@Peter: Could your improvements also be applied to RandomForests
an the oob score? Having a method there would also be quite nice.
Hi Andy,

I'll have a look at the IterGrid - never tried before.
Thanks for pointing it out!

Emanuele
Peter Prettenhofer
2012-06-24 08:03:33 UTC
Permalink
Emanuele,

I just realized that the above approach might not be what you actually
want: It will select the best value for ``n_estimators`` for _each_
fold - what we actually should do is to average the staged scores over
all folds and select the ``n_estimators`` which has the best average
score.

It's difficult if not impossible to accomplish this with GridSearchCV
thus you might have to resort to IterGrid for the time being.

what do you think?

best,
Peter
Post by Peter Prettenhofer
Post by Andreas Mueller
Hi Emanuele, hi Peter.
@Emanuele: You could also try to use IterGrid instead of GridSearchCV.
This might mean having to do some things by hand, but should work.
@Peter: Could your improvements also be applied to RandomForests
an the oob score? Having a method there would also be quite nice.
Hi Andy,
I'll have a look at the IterGrid - never tried before.
Thanks for pointing it out!
Emanuele
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Emanuele Olivetti
2012-06-21 22:15:33 UTC
Permalink
Post by Peter Prettenhofer
Post by Emanuele Olivetti
Dear All,
I am interested in attempting model selection with GridSearchCV() on
GradientBoostingRegressor(). I am quite new to boosting but I see from
the nice examples in sklearn documentation [0] that once the n_estimator
is fixed, it is possible to evaluate the classifiers at each boosting
iteration through GradientBoostingRegressor.staged_decision_function()
and similar things (oob_score_, staged_predict).
As in the figures of the examples the score on the test set, e.g. deviance,
has sometimes a minimum and it would be nice to get it during model selection
in order to score the given set of parameter values on it. How to do that within
GridSearchCV?
Hi Emanuele,
there is no straight-forward solution to this yet but it can be
accomplished by overwritting/monkey-batching
``GradientBoostingRegressor.score``.
Within ``score`` you call ``self.staged_predict`` and select the best
boosting iteration and return the result of the evaluation metric you
choose.
scores = [mse(y_true, y_pred) for y_pred in self.staged_predict(X)]
best_iter = np.argmin(scores)
best_score = scores[best_iter]
# set model to best iteration
self.n_estimator = best_iter + 1
self.estimators_ = self.estimators_[:self.n_estimators]
return best_score
GradientBoostingRegressor.score = custom_score
The drawback of this approch is that you cannot use the ``score_func``
and ``loss_func`` arguments of ``GridSearchCV`` because if they are
set ``GridSearchCV`` will use them instead of ``estimator.score``
I'm currently working on a PR with extends the functionality of the
gradient boosting module including some convenience methods for
finding the "optimal" number of estimators (=iterations). I'll keep
you posted.
Hi Peter,

Indeed a neat solution! Thanks a lot, it works very well :-)

Looking forward to your PR for improved model selection in
gradient boosting,

Emanuele
Loading...