[Scikit-learn-general] bootstrap depracation warning

We haven't been able to understand where in the context of machine learning this object was useful. Could you please give us an example of use.Â

GaÃ«l

<div>-------- Original message --------</div><div>From: Arman Eshaghi <***@gmail.com> </div><div>Date:15/08/2014 11:31 (GMT+01:00) </div><div>To: scikit-learn-***@lists.sourceforge.net </div><div>Subject: [Scikit-learn-general] bootstrap depracation warning </div><div>
</div>Hi,

I'm wondering why I'm getting deprecation warnings for bootstrap `sklearn.cross_validation.Bootstrap` . It seems to be a very useful feature, so maybe you are trying to transfer the class to another place? I'm writing some code that I need to run for a long time and I would be very much interested to see what the plan is for future.

Thanks
Arman

Arman Eshaghi

2014-08-17 15:55:15 UTC

I use it to get more stable results in cross validation, but I'm sure there
is a more important thing that I do not understand here, I will use
permutation (shuffle and split) from now on.

Thanks
Arman

On Fri, Aug 15, 2014 at 3:33 PM, Gael Varoquaux <

Post by Gael Varoquaux
We haven't been able to understand where in the context of machine
learning this object was useful. Could you please give us an example of
use.
Gaël
-------- Original message --------
From: Arman Eshaghi
Date:15/08/2014 11:31 (GMT+01:00)
Subject: [Scikit-learn-general] bootstrap depracation warning
Hi,
I'm wondering why I'm getting deprecation warnings for bootstrap
`sklearn.cross_validation.Bootstrap` . It seems to be a very useful
feature, so maybe you are trying to transfer the class to another place?
I'm writing some code that I need to run for a long time and I would be
very much interested to see what the plan is for future.
Thanks
Arman
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Sebastian Raschka

2014-08-17 17:58:14 UTC

I wouldn't remove bootstrapping. It is maybe not as commonly used as k-fold cross validation, but it is a quite established sampling technique. It would be good to keep this "sampling with replacement" alternative to cross validation.

Sebastian

I use it to get more stable results in cross validation, but I'm sure there is a more important thing that I do not understand here, I will use permutation (shuffle and split) from now on.
Thanks
Arman

Post by Gael Varoquaux
We haven't been able to understand where in the context of machine learning this object was useful. Could you please give us an example of use.
GaÃ«l
-------- Original message --------
From: Arman Eshaghi
Date:15/08/2014 11:31 (GMT+01:00)
Subject: [Scikit-learn-general] bootstrap depracation warning
Hi,
I'm wondering why I'm getting deprecation warnings for bootstrap `sklearn.cross_validation.Bootstrap` . It seems to be a very useful feature, so maybe you are trying to transfer the class to another place? I'm writing some code that I need to run for a long time and I would be very much interested to see what the plan is for future.
Thanks
Arman
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gael Varoquaux

2014-08-17 21:29:15 UTC

Post by Sebastian Raschka
I wouldn't remove bootstrapping. It is maybe not as commonly used as k-fold
cross validation, but it is a quite established sampling technique.

Indeed, I use bootstrap almost every day, but not in the context of
measuring predictive performance.

Post by Sebastian Raschka
It would be good to keep this "sampling with replacement" alternative
to cross validation.

How exactly would you use the Bootstrap object? I am worried that it is
getting misused. It does not belong to the same conceptual classes as the
other CV iterators.

Gaël

Sebastian Raschka

2014-08-17 21:38:16 UTC

Post by Gael Varoquaux
How exactly would you use the Bootstrap object? I am worried that it is
getting misused. It does not belong to the same conceptual classes as the
other CV iterators.

I would use it similarly to a Kfold CV object, but to address different "questions", e.g., the determination of the variance of estimated accuracy (or recall or precision etc.).

Post by Gael Varoquaux

Post by Sebastian Raschka
I wouldn't remove bootstrapping. It is maybe not as commonly used as k-fold
cross validation, but it is a quite established sampling technique.

Indeed, I use bootstrap almost every day, but not in the context of
measuring predictive performance.

Post by Sebastian Raschka
It would be good to keep this "sampling with replacement" alternative
to cross validation.

How exactly would you use the Bootstrap object? I am worried that it is
getting misused. It does not belong to the same conceptual classes as the
other CV iterators.
Gaël
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Olivier Grisel

2014-08-18 07:46:28 UTC

But the sklearn.cross_validation.Bootstrap currently implemented in sklearn
is a cross validation iterator, not a generic resampling method to estimate
variance or confidence intervals. Don't be mislead by the name. If we chose
to deprecate and then remove this class, it's precisely because it causes
confusion.

Arman Eshaghi

2014-08-18 07:57:06 UTC

thanks for the discussion. Could you please what the right way of using
boostraping for confidence interval calculation (or other statistics) would
be? I mean what would you do to get, as olivier said a "generic resampling
method to estimate variance or confidence intervals"? I'm under the
impression that I need to define my own function for this as it is
not exactly what I had in mind?

Also it seems that shuffle and split (I call it permutation) is also an
iterator for cross-validation (same confusion about bootstraping)?

Post by Olivier Grisel
But the sklearn.cross_validation.Bootstrap currently implemented in
sklearn is a cross validation iterator, not a generic resampling method to
estimate variance or confidence intervals. Don't be mislead by the name. If
we chose to deprecate and then remove this class, it's precisely because it
causes confusion.
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Olivier Grisel

2014-08-18 08:38:54 UTC

Post by Arman Eshaghi
thanks for the discussion. Could you please what the right way of using

boostraping for confidence interval calculation (or other statistics) would
be? I mean what would you do to get, as olivier said a "generic resampling
method to estimate variance or confidence intervals"? I'm under the
impression that I need to define my own function for this as it is
not exactly what I had in mind?

For genericbootstrap confidence intervals you can use scikits.bootstrap (a
small separate project). I would personally be in favor of having such
tools in scipy.stats by default though.

Post by Arman Eshaghi
Also it seems that shuffle and split (I call it permutation) is also an

Arman Eshaghi

2014-08-18 09:17:08 UTC

thanks, very informative.

Post by Arman Eshaghi

Post by Arman Eshaghi
thanks for the discussion. Could you please what the right way of using

boostraping for confidence interval calculation (or other statistics) would
be? I mean what would you do to get, as olivier said a "generic resampling
method to estimate variance or confidence intervals"? I'm under the
impression that I need to define my own function for this as it is
not exactly what I had in mind?
For genericbootstrap confidence intervals you can use scikits.bootstrap (a
small separate project). I would personally be in favor of having such
tools in scipy.stats by default though.

Post by Arman Eshaghi
Also it seems that shuffle and split (I call it permutation) is also an

iterator for cross-validation (same confusion about bootstraping)?
Yes but contrary to our deprecated Bootstrap class, the shuffle & split
strategy is a standard way to prepare folds for cross validation. You can
see it as a generalization of iterated randomized k fold cross validation
where you decouple test fold size from the number of folds / iterations.
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Sebastian Raschka

2014-08-18 14:15:44 UTC

But the sklearn.cross_validation.Bootstrap currently implemented in sklearn is a cross validation iterator, not a generic resampling method to estimate variance or confidence intervals. Don't be mislead by the name. If we chose to deprecate and then remove this class, it's precisely because it causes confusion.

Hm, I can kind of see why the Bootstrap calls was initially put into sklearn.cross_validation, technically, the "approaches" (cross validation, bootstrap, jackknife) are very related. The only difference is that you have sampling "with replacement" in the bootstrap approach and that you would typically want to have >1000 iterations.
So, the suggestion would be to remove Bootstrap and use sklearn.utils.resample in future? I would say that it is good that the Bootstrap is implemented like an CV object, since it would make the "estimate" and "error" calculation more convenient, right?

Olivier Grisel

2014-08-18 16:15:37 UTC

Post by Olivier Grisel
But the sklearn.cross_validation.Bootstrap currently implemented in

sklearn is a cross validation iterator, not a generic resampling method to
estimate variance or confidence intervals. Don't be mislead by the name. If
we chose to deprecate and then remove this class, it's precisely because it
causes confusion.

Post by Sebastian Raschka
Hm, I can kind of see why the Bootstrap calls was initially put into

sklearn.cross_validation, technically, the "approaches" (cross validation,
bootstrap, jackknife) are very related. The only difference is that you
have sampling "with replacement" in the bootstrap approach and that you
would typically want to have >1000 iterations.

Post by Sebastian Raschka
So, the suggestion would be to remove Bootstrap and use

sklearn.utils.resample in future?

Well it depends why do you want to use bootstrapping for. If it's for model
evaluation (estimation of some validation score), then the recommended way
is to use ShuffleSplit or StratifiedShuffleSplit instead. If you want
generic bootstrap estimation features such as confidence interval
estimation (that does not exist in scikit-learn by the way), then I would
recommend you to have a look at scikits.bootstrap [1] which also implement
bias correction for skewed distribution which is non-trivial to do manually.

[1] https://scikits.appspot.com/bootstrap

sklearn.utils is meant only for internal use in the scikit-learn project.
For instance sklearn.utils.resample is useful to implement resampling
internally in bagging models if I remember correctly.

Post by Sebastian Raschka
I would say that it is good that the Bootstrap is implemented like an CV

object,

I precisely think the opposite. There is no point in using sampling with
replacement vs sampling without replacement to estimate the validation
error of a model. Traditional strategies for cross-validation as
implemented in Shuffle & Split are as flexible and simpler to interpret
than our weird Bootstrap cross-validation iterator.

See also:

Post by Sebastian Raschka
since it would make the "estimate" and "error" calculation more

convenient, right?

I don't understand what you mean "estimate" by "error". Both the model
parameters, its individual predictions and its cross-validation scores or
errors can be called "estimates": anything that is derived from sampled
data points is an estimate.

--
Olivier

j***@gmail.com

2014-08-18 16:28:48 UTC

Post by Olivier Grisel
But the sklearn.cross_validation.Bootstrap currently implemented in

Post by Sebastian Raschka
Hm, I can kind of see why the Bootstrap calls was initially put into

Post by Sebastian Raschka
So, the suggestion would be to remove Bootstrap and use

sklearn.utils.resample in future?
Well it depends why do you want to use bootstrapping for. If it's for
model evaluation (estimation of some validation score), then the
recommended way is to use ShuffleSplit or StratifiedShuffleSplit instead.
If you want generic bootstrap estimation features such as confidence
interval estimation (that does not exist in scikit-learn by the way), then
I would recommend you to have a look at scikits.bootstrap [1] which also
implement bias correction for skewed distribution which is non-trivial to
do manually.
[1] https://scikits.appspot.com/bootstrap
sklearn.utils is meant only for internal use in the scikit-learn project.
For instance sklearn.utils.resample is useful to implement resampling
internally in bagging models if I remember correctly.

Post by Sebastian Raschka
I would say that it is good that the Bootstrap is implemented like an CV

object,
I precisely think the opposite. There is no point in using sampling with
replacement vs sampling without replacement to estimate the validation
error of a model. Traditional strategies for cross-validation as
implemented in Shuffle & Split are as flexible and simpler to interpret
than our weird Bootstrap cross-validation iterator.
See also: http://youtu.be/BzHz0J9a6k0

Post by Sebastian Raschka
since it would make the "estimate" and "error" calculation more

convenient, right?
I don't understand what you mean "estimate" by "error". Both the model
parameters, its individual predictions and its cross-validation scores or
errors can be called "estimates": anything that is derived from sampled
data points is an estimate.

Just a remark from the sidelines,
(I hope to get bootstrap and cross-validation iterators into the next
version of statsmodels, borrowing some of the ideas and code from
scikit-learn, but emphasis in statsmodels will be on bootstrap and
permutation iterators.)

What I think sklearn doesn't have, are early stopping with randomized
selection for cross-validation iterators. If LOO/jacknife are expensive to
calculate for all LOO sets. Can you randomly select among the LOO sets, or
similar for other iterators?
Similar, permutation inference is often difficult because the set of
permutations is getting too large, then bootstrap is the usual alternative
for larger samples.

(I may be incorrect since I only briefly looked at the changes to your
cross-validation.)

Josef

Post by Olivier Grisel
--
Olivier
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Olivier Grisel

2014-08-18 16:43:25 UTC

This post might be inappropriate. Click to display it.

j***@gmail.com

2014-08-18 17:37:59 UTC

On Mon, Aug 18, 2014 at 12:15 PM, Olivier Grisel <

Ã©crit

Post by Olivier Grisel
But the sklearn.cross_validation.Bootstrap currently implemented in
sklearn is a cross validation iterator, not a generic resampling

method to

Post by Olivier Grisel
estimate variance or confidence intervals. Don't be mislead by the

name. If

Post by Olivier Grisel
we chose to deprecate and then remove this class, it's precisely

because it

Post by Olivier Grisel
causes confusion.

Hm, I can kind of see why the Bootstrap calls was initially put into
sklearn.cross_validation, technically, the "approaches" (cross

validation,

Post by Sebastian Raschka
bootstrap, jackknife) are very related. The only difference is that

you have

Post by Sebastian Raschka
sampling "with replacement" in the bootstrap approach and that you

would

Post by Sebastian Raschka
typically want to have >1000 iterations.
So, the suggestion would be to remove Bootstrap and use
sklearn.utils.resample in future?

Well it depends why do you want to use bootstrapping for. If it's for
model evaluation (estimation of some validation score), then the

recommended

Post by Olivier Grisel
way is to use ShuffleSplit or StratifiedShuffleSplit instead. If you

want

Post by Olivier Grisel
generic bootstrap estimation features such as confidence interval

estimation

Post by Olivier Grisel
(that does not exist in scikit-learn by the way), then I would

recommend you

Post by Olivier Grisel
to have a look at scikits.bootstrap [1] which also implement bias

correction

Post by Olivier Grisel
for skewed distribution which is non-trivial to do manually.
[1] https://scikits.appspot.com/bootstrap
sklearn.utils is meant only for internal use in the scikit-learn

project.

Post by Olivier Grisel
For instance sklearn.utils.resample is useful to implement resampling
internally in bagging models if I remember correctly.

Post by Sebastian Raschka
I would say that it is good that the Bootstrap is implemented like an

Post by Sebastian Raschka
object,

I precisely think the opposite. There is no point in using sampling with
replacement vs sampling without replacement to estimate the validation

error

Post by Olivier Grisel
of a model. Traditional strategies for cross-validation as implemented

Post by Olivier Grisel
Shuffle & Split are as flexible and simpler to interpret than our weird
Bootstrap cross-validation iterator.
See also: http://youtu.be/BzHz0J9a6k0

Post by Sebastian Raschka
since it would make the "estimate" and "error" calculation more
convenient, right?

I don't understand what you mean "estimate" by "error". Both the model
parameters, its individual predictions and its cross-validation scores

Post by Olivier Grisel
errors can be called "estimates": anything that is derived from sampled

data

Post by Olivier Grisel
points is an estimate.

calculate for all LOO sets. Can you randomly select among the LOO sets,

similar for other iterators?

No, but that's would be good idea for ShuffleSplit as well. If I
understand correctly, you would pass something like tolerance
parameter (e.g. I want a validation score with precise to 2 decimals)
and use as few iterations as possible to each that precision and then
stop sampling. Is that right?

That's open to API decisions.
So far I have been going both ways, let users specify the number of
permutations and provide helper functions to check precision, or allow to
continue until a precision is reached. (my examples were usually to target
p-values)

I haven't made up my mind about one or the other or both.

Similar, permutation inference is often difficult because the set of
permutations is getting too large, then bootstrap is the usual

alternative

for larger samples.
(I may be incorrect since I only briefly looked at the changes to your
cross-validation.)

One thing to keep in mind is that sklearn.cross_validation.Bootstrap
is not the real bootstrap: it's a random permutation + split + random
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py#L718
This 2 steps procedures is done to make sure that no test samples is
part of the training fold at each iteration. A more natural way to
respect that constraint would be to sample with replacement from the
full dataset and then use out-of-bags samples for the validation set.
But then you would loose control on the size of the test fold. This
second strategy is more like the real bootstrap and is the one I
should have implemented initially instead of the weird beast that
sklearn.cross_validation.Bootstrap is currently.

I would have thought of a slightly simplified version, where the testset is
always the full set, so you have the bootstrap sampling only on the
training sample.

Or even simpler, keep the split between train and test sample fixed.
I might be thinking of different applications. The main focus for
statsmodels to complement the ones in scikit-learn will be for data without
independent observations, or a natural sequence, time series, correlated
data, ...

But, I've never seen bootstrap for cross-validation.

Josef

--
Olivier
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Sebastian Raschka

2014-08-18 18:44:28 UTC

Post by Sebastian Raschka
since it would make the "estimate" and "error" calculation more convenient, right?

For example, the calculation of the mean-accuracy from all iterations, and the calculation of the standard deviation/error of the mean (just like in regular Kfold cross-validation). I have to agree that there are probably better approaches and techniques as you mentioned, but I wouldn't remove it just because very few people use it in practice.

Olivier Grisel

2014-08-19 19:15:12 UTC