Discussion:
Nested cross-validation
(too old to reply)
Sebastian Raschka
2015-05-11 13:30:14 UTC
Permalink
Hi,
I stumbled upon the brief note about nested cross-validation in the online documentation at http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#grid-search
=====================
Nested cross-validation
cross_validation.cross_val_score(clf, X_digits, y_digits)
...


array([ 0.938..., 0.963..., 0.944...])
Two cross-validation loops are performed in parallel: one by the GridSearchCV estimator to set gamma and the other one bycross_val_score to measure the prediction performance of the estimator. The resulting scores are unbiased estimates of the prediction score on new data.
=====================

I am wondering how to "use" or "interpret" those scores. For example, if the gamma parameters are set differently in the inner loops, we accumulate test scores from the outer loops that would correspond to different models, and calculating the average performance from those scores wouldn't be a good idea? So, if the estimated parameters are different for the different inner folds, I would say that my model is not "stable" and varies a lot with respect to the chosen training fold.

In general, what would speak against an approach to just split the initial dataset into train/test (70/30), perform grid search (via k-fold CV) on the training set, and evaluate the model performance on the test dataset?

Best,
Sebastian
Michael Eickenberg
2015-05-11 13:37:03 UTC
Permalink
Post by Sebastian Raschka
Hi,
I stumbled upon the brief note about nested cross-validation in the online
documentation at
http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#grid-search
=====================
Nested cross-validation
cross_validation.cross_val_score(clf, X_digits, y_digits)
...
array([ 0.938..., 0.963..., 0.944...])
Two cross-validation loops are performed in parallel: one by the
GridSearchCV estimator to set gamma and the other one bycross_val_score to
measure the prediction performance of the estimator. The resulting scores
are unbiased estimates of the prediction score on new data.
=====================
I am wondering how to "use" or "interpret" those scores. For example, if
the gamma parameters are set differently in the inner loops, we accumulate
test scores from the outer loops that would correspond to different models,
and calculating the average performance from those scores wouldn't be a
good idea? So, if the estimated parameters are different for the different
inner folds, I would say that my model is not "stable" and varies a lot
with respect to the chosen training fold.
In general, what would speak against an approach to just split the initial
dataset into train/test (70/30), perform grid search (via k-fold CV) on the
training set, and evaluate the model performance on the test dataset?
Nothing, except that you are probably evaluating several parameter values.
Choosing the best one and reporting that one is overfitting because it uses
the test data to evaluate which parameter is best.

In the inner CV loop you do basically that: select the best model based on
evaluation on a test set. In order to evaluate the model's performance "at
best selected gamma" you then need to evaluate again on previously unseen
data.

This is automated in the mentioned cross_val_score + GridSearchCV loop, but
you can also do it by hand by splitting your data in 3 instead of 2.
Post by Sebastian Raschka
Best,
Sebastian
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Michael Eickenberg
2015-05-11 13:41:10 UTC
Permalink
Sorry, I misread what you wrote. Your suggested approach is perfectly find
and corresponds exactly to what would happen if you did the mentioned
cross_val_score + GridSearchCV on a train-test split of one 70-30 fold.
Doing it several times using e.g. an outer KFold just gives you several
scores to do some stats on.

On Mon, May 11, 2015 at 3:37 PM, Michael Eickenberg <
Post by Michael Eickenberg
Post by Sebastian Raschka
Hi,
I stumbled upon the brief note about nested cross-validation in the
online documentation at
http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#grid-search
=====================
Nested cross-validation
cross_validation.cross_val_score(clf, X_digits, y_digits)
...
array([ 0.938..., 0.963..., 0.944...])
Two cross-validation loops are performed in parallel: one by the
GridSearchCV estimator to set gamma and the other one bycross_val_score to
measure the prediction performance of the estimator. The resulting scores
are unbiased estimates of the prediction score on new data.
=====================
I am wondering how to "use" or "interpret" those scores. For example, if
the gamma parameters are set differently in the inner loops, we accumulate
test scores from the outer loops that would correspond to different models,
and calculating the average performance from those scores wouldn't be a
good idea? So, if the estimated parameters are different for the different
inner folds, I would say that my model is not "stable" and varies a lot
with respect to the chosen training fold.
In general, what would speak against an approach to just split the
initial dataset into train/test (70/30), perform grid search (via k-fold
CV) on the training set, and evaluate the model performance on the test
dataset?
Nothing, except that you are probably evaluating several parameter values.
Choosing the best one and reporting that one is overfitting because it uses
the test data to evaluate which parameter is best.
In the inner CV loop you do basically that: select the best model based on
evaluation on a test set. In order to evaluate the model's performance "at
best selected gamma" you then need to evaluate again on previously unseen
data.
This is automated in the mentioned cross_val_score + GridSearchCV loop,
but you can also do it by hand by splitting your data in 3 instead of 2.
Post by Sebastian Raschka
Best,
Sebastian
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Sebastian Raschka
2015-05-11 13:58:47 UTC
Permalink
Thanks. However, the GridSearch may be very expensive considering the parameters may not change for the different folds in the nested approach. On the other hand, if they change, one cannot really calculate the average performance from the outer KFold scores.
Sorry, I misread what you wrote. Your suggested approach is perfectly find and corresponds exactly to what would happen if you did the mentioned cross_val_score + GridSearchCV on a train-test split of one 70-30 fold. Doing it several times using e.g. an outer KFold just gives you several scores to do some stats on.
Hi,
I stumbled upon the brief note about nested cross-validation in the online documentation at http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#grid-search <http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#grid-search>
=====================
Nested cross-validation
cross_validation.cross_val_score(clf, X_digits, y_digits)
...
array([ 0.938..., 0.963..., 0.944...])
Two cross-validation loops are performed in parallel: one by the GridSearchCV estimator to set gamma and the other one bycross_val_score to measure the prediction performance of the estimator. The resulting scores are unbiased estimates of the prediction score on new data.
=====================
I am wondering how to "use" or "interpret" those scores. For example, if the gamma parameters are set differently in the inner loops, we accumulate test scores from the outer loops that would correspond to different models, and calculating the average performance from those scores wouldn't be a good idea? So, if the estimated parameters are different for the different inner folds, I would say that my model is not "stable" and varies a lot with respect to the chosen training fold.
In general, what would speak against an approach to just split the initial dataset into train/test (70/30), perform grid search (via k-fold CV) on the training set, and evaluate the model performance on the test dataset?
Nothing, except that you are probably evaluating several parameter values. Choosing the best one and reporting that one is overfitting because it uses the test data to evaluate which parameter is best.
In the inner CV loop you do basically that: select the best model based on evaluation on a test set. In order to evaluate the model's performance "at best selected gamma" you then need to evaluate again on previously unseen data.
This is automated in the mentioned cross_val_score + GridSearchCV loop, but you can also do it by hand by splitting your data in 3 instead of 2.
Best,
Sebastian
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y <http://ad.doubleclick.net/ddm/clk/290420510;117567292;y>
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Michael Eickenberg
2015-05-11 14:32:39 UTC
Permalink
Post by Sebastian Raschka
On the other hand, if they change, one cannot really calculate the average
performance from the outer KFold scores.
Why not? If one sees the GridSearchCV(simple_estimator) as "the best that
simple_estimator can do if we let it try several parameters", then
everything becomes consistent. You are basically testing how good
simple_estimator can be if you give it the chance to choose hyperparameters
using data. You are testing the validity of simple_estimator vs the
validity of simple_estimator(one_specific_parameter) in face of the data at
hand.

But that is theoretical. In practice, selecting e.g. a best penalty can be
a very noisy operation across folds, which is why some resort to model
averaging etc.
Post by Sebastian Raschka
On May 11, 2015, at 9:41 AM, Michael Eickenberg <
Sorry, I misread what you wrote. Your suggested approach is perfectly find
and corresponds exactly to what would happen if you did the mentioned
cross_val_score + GridSearchCV on a train-test split of one 70-30 fold.
Doing it several times using e.g. an outer KFold just gives you several
scores to do some stats on.
On Mon, May 11, 2015 at 3:37 PM, Michael Eickenberg <
Post by Michael Eickenberg
Post by Sebastian Raschka
Hi,
I stumbled upon the brief note about nested cross-validation in the
online documentation at
http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#grid-search
=====================
Nested cross-validation
cross_validation.cross_val_score(clf, X_digits, y_digits)
...
array([ 0.938..., 0.963..., 0.944...])
Two cross-validation loops are performed in parallel: one by the
GridSearchCV estimator to set gamma and the other one bycross_val_score to
measure the prediction performance of the estimator. The resulting scores
are unbiased estimates of the prediction score on new data.
=====================
I am wondering how to "use" or "interpret" those scores. For example, if
the gamma parameters are set differently in the inner loops, we accumulate
test scores from the outer loops that would correspond to different models,
and calculating the average performance from those scores wouldn't be a
good idea? So, if the estimated parameters are different for the different
inner folds, I would say that my model is not "stable" and varies a lot
with respect to the chosen training fold.
In general, what would speak against an approach to just split the
initial dataset into train/test (70/30), perform grid search (via k-fold
CV) on the training set, and evaluate the model performance on the test
dataset?
Nothing, except that you are probably evaluating several parameter
values. Choosing the best one and reporting that one is overfitting because
it uses the test data to evaluate which parameter is best.
In the inner CV loop you do basically that: select the best model based
on evaluation on a test set. In order to evaluate the model's performance
"at best selected gamma" you then need to evaluate again on previously
unseen data.
This is automated in the mentioned cross_val_score + GridSearchCV loop,
but you can also do it by hand by splitting your data in 3 instead of 2.
Post by Sebastian Raschka
Best,
Sebastian
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Satrajit Ghosh
2015-05-11 13:43:16 UTC
Permalink
hi sebastian,

I am wondering how to "use" or "interpret" those scores. For example, if
Post by Sebastian Raschka
the gamma parameters are set differently in the inner loops, we accumulate
test scores from the outer loops that would correspond to different models,
and calculating the average performance from those scores wouldn't be a
good idea? So, if the estimated parameters are different for the different
inner folds, I would say that my model is not "stable" and varies a lot
with respect to the chosen training fold.
i think this speaks to the nature of the data more than the nature of
cross-validation. for cross-validation (or validation) the general
assumption is that samples are similarly distributed, such that models
built on a subset can generalize or extrapolate to out of sample data.

we could take an extreme artificial situation, where i have training data
from a group of individuals measured with one instrument, and my test data
are from a similar but different instrument that has a consistent bias. no
amount of model building on the training is going to prepare it for bias in
the second instrument.

thus, if the histogram of your model parameters from nested
cross-validation are quite different, i believe the key issue is that the
data being fed as training and test are quite different from each other.
with larger samples, this tends to even itself out.

In general, what would speak against an approach to just split the initial
Post by Sebastian Raschka
dataset into train/test (70/30), perform grid search (via k-fold CV) on the
training set, and evaluate the model performance on the test dataset?
isn't this what the cross-val score really does? just keeps repeating for
several different outer splits? the reason outer splits are important is
again to account for distributional characteristics in smallish-samples.

cheers,

satra
Sebastian Raschka
2015-05-11 14:04:04 UTC
Permalink
Hi, Satrajit,
Post by Sebastian Raschka
In general, what would speak against an approach to just split the initial dataset into train/test (70/30), perform grid search (via k-fold CV) on the training set, and evaluate the model performance on the test dataset?
isn't this what the cross-val score really does? just keeps repeating for several different outer splits? the reason outer splits are important is again to account for distributional characteristics in smallish-samples.
sorry, maybe I was a little bit unclear, what I meant was the scenario 2) in contrast to 1) below:

1) perform k-fold cross-validation on the complete dataset for model selection and then report the score as estimate of the model's performance (not a good idea!)

2) split the dataset, only do cross-validation on the training set (which is then further subdivided into training and test folds), select the model based on the results, and then use the separate test set that the model hasn't seen before to estimate its performance to generalize


Best,
Sebastian
Satrajit Ghosh
2015-05-11 14:45:03 UTC
Permalink
hi sebastian,

sorry, maybe I was a little bit unclear, what I meant was the scenario 2)
Post by Sebastian Raschka
1) perform k-fold cross-validation on the complete dataset for model
selection and then report the score as estimate of the model's performance
(not a good idea!)
if you mean that you choose the best-subset performance as the model, then
that's not a good idea.

2) split the dataset, only do cross-validation on the training set (which
Post by Sebastian Raschka
is then further subdivided into training and test folds), select the model
based on the results, and then use the separate test set that the model
hasn't seen before to estimate its performance to generalize
this is ok.

but cross_val_score with a GridSearchCV as clf, will in fact do 2) but for
cv different types of train and test sets. and as michael points out what
you are effectively validating is a model that also selects it's parameters
(the parameters given to grid search).

the model in this latter case is really gridsearchcv, not the specific clf
used inside it.

cheers,

satra

------------------------------------------------------------------------------
Post by Sebastian Raschka
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Continue reading on narkive:
Loading...