Post by Olivier GriselPost by Sebastian Raschkasince it would make the "estimate" and "error" calculation more convenient, right?
I don't understand what you mean "estimate" by "error". Both the model
parameters, its individual predictions and its cross-validation scores or
errors can be called "estimates": anything that is derived from sampled data
points is an estimate.
For example, the calculation of the mean-accuracy from all iterations, and
the calculation of the standard deviation/error of the mean
Well this is not what sklearn.cross_validation.Bootstrap is doing.
It's doing some weird cross-validation splits that I made up a couple
of years ago (and that I now regret deeply) and that nobody uses in
the literature. Again read its docstring and have a look at the source
code:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py#L718
No-where you will see and estimate of the standard deviation of the
validation score nor the standard error of the mean validation score
across folds.
Post by Olivier Grisel(just like in regular Kfold cross-validation).
The KFold cross-validation iterator in sklearn does not compute the
standard error of the mean score itself. The cross_val_score function
with cv=KFold(5) returns the score on computed each validation fold.
It would be interesting to estimate the standard deviation of the
validation score (or better a 95% confidence interval of it) but:
- this is not what sklearn.cross_validation.Bootstrap is doing: it
just compute CV folds as all the other iterators in the
sklearn.cross_validation module
- estimating is the standard error of the mean of 5 points (for 5-fold
CV for instance) using a bootstrapping procedure is prone to lead to
bad results.
Empirically I found that bootstrapping works fine to estimate
confidence intervals with *at least* 50 samples (and thousands of
bootstrap iterations).
Therefore to obtain good confidence intervals on CV scores, the right
approach (in my opinion) would be to:
1- have some kind of cross_val_predictions function that would return
individual predictions for each sample in any of the validation folds
of a CV procedure instead of the score on each folds as our
cross_val_score function does;
2- use a bootstrapping procedure by re-sampling many times with
replacement out of those predictions so as to compute a bootstrapped
distribution of the validation score using;
3- take a confidence interval on that bootstrapped distribution of the
validation score.
Furthermore as typical scoring functions are censored (for instance
the accuracy score is bounded by 0 and 1), it is very likely that the
bootstrapped distribution of the validation score is going to be
skewed (for instance a validation accuracy score distribution could
have a 95% confidence interval between 0.94 and 1.00 with a mean at
0.99). For skewed distributions a naive percentile interval is
typically wrong because of the bias introduced by the skewness. In
that case this bias can be corrected by using the Bias-Corrected
Accelerated Non-Parametric bootstrap procedure as implemented in
scikits.bootstrap:
https://github.com/cgevans/scikits-bootstrap/blob/master/scikits/bootstrap/bootstrap.py#L70
Having BCA bootstrap confidence intervals in scipy.stats would
certainly make it simpler to implement this kind of feature in
scikit-learn. But again what I just described here is completely
different from what we have in the sklearn.cross_validation.Bootstrap
class. The sklearn.cross_validation.Bootstrap class cannot be changed
to implement this as it does not even have the right API to do so. It
would be have to be an entirely new function or class.
Post by Olivier GriselI have to agree that there are probably better approaches and techniques as you mentioned, but I wouldn't remove it
just because very few people use it in practice.
We don't remove the sklearn.cross_validation.Bootstrap class because
few people are using it, but because too many people are using
something that is non-standard (I made it up) and very very likely not
what they expect if they just read its name. At best it is causing
confusion when our users read the docstring and/or its source code. At
worse it causes silent modeling errors in our users code base.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel