Discussion:
[Scikit-learn-general] Real simple covariate shift correction using logistic regression
Olivier Grisel
2011-03-27 13:26:45 UTC
Permalink
For the twitter-impaired lurking around the mailing list, here is a
very interesting post by Alexander Smola to correct distribution
discrepancies between training set and test set using a simple
logistic regression model that is used to re-weights the training
samples:

http://blog.smola.org/post/4110255196/real-simple-covariate-shift-correction

This means that this approach could be implemented straightforwardly
using the SVC and SGD models which now both support sample
re-weighting. Does someone has an idea of a good dataset to
demonstrate this on an example?

One could use artificial dataset but it does not feel right :)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Alexandre Passos
2011-03-27 13:33:21 UTC
Permalink
Post by Olivier Grisel
For the twitter-impaired lurking around the mailing list, here is a
very interesting post by Alexander Smola to correct distribution
discrepancies between training set and test set using a simple
logistic regression model that is used to re-weights the training
 http://blog.smola.org/post/4110255196/real-simple-covariate-shift-correction
This means that this approach could be implemented straightforwardly
using the SVC and SGD models which now both support sample
re-weighting. Does someone has an idea of a good dataset to
demonstrate this on an example?
One could use artificial dataset but it does not feel right :)
People in opinion mining use the bitterlemons dataset for text
classification a lot (
https://sites.google.com/site/weihaolinatcmu/data ). There is a "hard
mode" often used to test adaptability of algorithms where you train a
classifier on the texts from editors of the website to predict wether
the opinion is from the israeli or palestinian side and test on
guests (which have different writing styles and make for a more
heterogeneous collection overall). Usually bayesian or otherwise
generative methods beat SVMs and logreg in this setting, but maybe
this could change with appropriate importance weights.
--
 - Alexandre
xinfan meng
2011-03-27 14:02:33 UTC
Permalink
Post by Olivier Grisel
Post by Olivier Grisel
For the twitter-impaired lurking around the mailing list, here is a
very interesting post by Alexander Smola to correct distribution
discrepancies between training set and test set using a simple
logistic regression model that is used to re-weights the training
http://blog.smola.org/post/4110255196/real-simple-covariate-shift-correction
Post by Olivier Grisel
This means that this approach could be implemented straightforwardly
using the SVC and SGD models which now both support sample
re-weighting. Does someone has an idea of a good dataset to
demonstrate this on an example?
One could use artificial dataset but it does not feel right :)
People in opinion mining use the bitterlemons dataset for text
classification a lot (
https://sites.google.com/site/weihaolinatcmu/data ). There is a "hard
mode" often used to test adaptability of algorithms where you train a
classifier on the texts from editors of the website to predict wether
the opinion is from the israeli or palestinian side and test on
guests (which have different writing styles and make for a more
heterogeneous collection overall). Usually bayesian or otherwise
generative methods beat SVMs and logreg in this setting, but maybe
this could change with appropriate importance weights.
Really? That sounds interesting. Is there some papers comparing these
algorithms?
Post by Olivier Grisel
--
- Alexandre
------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Best Wishes
--------------------------------------------
Meng Xinfan蒙新泛
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China
j***@gmail.com
2011-03-27 14:17:59 UTC
Permalink
Post by xinfan meng
Post by Alexandre Passos
Post by Olivier Grisel
For the twitter-impaired lurking around the mailing list, here is a
very interesting post by Alexander Smola to correct distribution
discrepancies between training set and test set using a simple
logistic regression model that is used to re-weights the training
 http://blog.smola.org/post/4110255196/real-simple-covariate-shift-correction
This means that this approach could be implemented straightforwardly
using the SVC and SGD models which now both support sample
re-weighting. Does someone has an idea of a good dataset to
demonstrate this on an example?
One could use artificial dataset but it does not feel right :)
People in opinion mining use the bitterlemons dataset for text
classification a lot (
https://sites.google.com/site/weihaolinatcmu/data ). There is a "hard
mode" often used to test adaptability of algorithms where you train a
classifier on the texts from editors of the website to predict wether
the opinion is  from the israeli or palestinian side and test on
guests (which have different writing styles and make for a more
heterogeneous collection overall). Usually bayesian or otherwise
generative methods beat SVMs and logreg in this setting, but maybe
this could change with appropriate importance weights.
Really? That sounds interesting. Is there some papers comparing these
algorithms?
That's interesting, machine learning (re)discovers econometrics problems.

Jim Heckman received the Nobel price among other things for the two
step model. The general model has a binary selection model in the
first stage to correct for the selection bias in the sample, and a
corrected regression in the second stage.
(The problems are a bit different because the second stage is usually
estimating treatment effects instead of classification.)

Josef
Post by xinfan meng
Post by Alexandre Passos
--
 - Alexandre
------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Best Wishes
--------------------------------------------
Meng Xinfan(蒙新泛)
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China
------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2011-03-27 21:25:26 UTC
Permalink
Does someone has an idea of a good dataset to demonstrate this on an
example?
We can't use one of the datasets that we already have? I am getting a bit
worried about the size of the datasets that is starting to be required to
have a full set of running example. Seems a bit Laloudouana et al to me
[*].

G

[*] http://rakaposhi.eas.asu.edu/f02-cse494-mailarchive/pdf00004.pdf
Peter Prettenhofer
2011-03-28 08:26:40 UTC
Permalink
Hi,

IMHO the checkerboards example of [Hein09] is a great illustration of
covariate shift (see Figure 1.2).
I've made an example for my research group - it's rather lengthy but
that's because it's interactive and let you specify P(X) [1].

In the covariate shift setting only P(X) is assumed to differ between
training and testing phase, P(Y|X) is assumed to be identical. In this
example,
P(Y|X) is deterministic - it's the checkerboard pattern - negative on
the diagonal - positive off diagonal. P(X) is given by the probability
that an example is taken from a specific cell on the checkerboard.
Thus, P(Y|X) is non-linear; if you choose a discriminative model and
your model is underspecified (e.g. linear in our case), you can make
your error as bad as possible simply by changing P(X). It's a nice
illustration that P(X) matters even if you just want to model P(Y|X).

best,
Peter

[1] https://gist.github.com/890145

M. Hein (2009). Binary Classification under Sample Selection Bias, In
Dataset Shift in Machine Learning, chap. 3, pp. 41-64. The MIT Press.
http://www.ml.uni-saarland.de/Publications/Hein%20-%20Binary%20Classification%20under%20Sample%20Selection%20Bias(2008).pdf
Post by Gael Varoquaux
Does someone has an idea of a good dataset to demonstrate this on an
example?
We can't use one of the datasets that we already have? I am getting a bit
worried about the size of the datasets that is starting to be required to
have a full set of running example. Seems a bit Laloudouana et al to me
[*].
G
[*] http://rakaposhi.eas.asu.edu/f02-cse494-mailarchive/pdf00004.pdf
------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Olivier Grisel
2011-03-28 09:37:01 UTC
Permalink
Post by Peter Prettenhofer
Hi,
IMHO the checkerboards example of [Hein09] is a great illustration of
covariate shift (see Figure 1.2).
I've made an example for my research group - it's rather lengthy but
that's because it's interactive and let you specify P(X) [1].
In the covariate shift setting only P(X) is assumed to differ between
training and testing phase, P(Y|X) is assumed to be identical. In this
example,
P(Y|X) is deterministic - it's the checkerboard pattern - negative on
the diagonal - positive off diagonal. P(X) is given by the probability
that an example is taken from a specific cell on the checkerboard.
Thus, P(Y|X) is non-linear; if you choose a discriminative model and
your model is underspecified (e.g. linear in our case), you can make
your error as bad as possible simply by changing P(X). It's a nice
illustration that P(X) matters even if you just want to model P(Y|X).
best,
 Peter
[1] https://gist.github.com/890145
M. Hein (2009). Binary Classification under Sample Selection Bias, In
Dataset Shift in Machine Learning, chap. 3, pp. 41-64. The MIT Press.
http://www.ml.uni-saarland.de/Publications/Hein%20-%20Binary%20Classification%20under%20Sample%20Selection%20Bias(2008).pdf
Post by Gael Varoquaux
Does someone has an idea of a good dataset to demonstrate this on an
example?
We can't use one of the datasets that we already have? I am getting a bit
worried about the size of the datasets that is starting to be required to
have a full set of running example. Seems a bit Laloudouana et al to me
[*].
G
[*] http://rakaposhi.eas.asu.edu/f02-cse494-mailarchive/pdf00004.pdf
Very interesting example, however the logistic regression based sample
weights will probably not help much for this since the data is highly
non-linearly separable. One could try to use the same trick using
weights based on the predict_proba of a gaussian RBF SVM though.

Alexandre suggestion for the Bitterlemons Corpus seem both real life
and suitable for linear models, hence should be a straightforward
application of Alexander Smola's blog post.

Gael: this can be an example without the plot_ prefix and in this case
the data does not have to be downloaded to build the documentation (at
least I hope). I don't think any dataset of the scikit And indeed this
is a typical application of the Laloudouana principled theory for
dataset selection :)

Anyway I don't plan to implement this example soon but could be a nice
real life illustration for the sample weights feature.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2011-03-28 09:41:10 UTC
Permalink
Post by Olivier Grisel
Gael: this can be an example without the plot_ prefix and in this case
the data does not have to be downloaded to build the documentation (at
least I hope). I don't think any dataset of the scikit
I got interrupted I forgot to finish this sentence: I meant I don't
think any dataset currently in the scikit showcase a natural sample
selection discrepancy between two subsets.
Post by Olivier Grisel
And indeed this
is a typical application of the Laloudouana principled theory for
dataset selection :)
BTW, the Bitterlemons is just 3 MB...
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2011-03-28 12:55:29 UTC
Permalink
Post by Olivier Grisel
Post by Olivier Grisel
And indeed this
is a typical application of the Laloudouana principled theory for
dataset selection :)
BTW, the Bitterlemons is just 3 MB...
Will it be useful elsewhere.

For pedagogique purposes, its also good to limit the number of datasets.
That way through the examples it makes it easier to compare the
approaches.

G

Gael Varoquaux
2011-03-28 12:54:26 UTC
Permalink
Post by Olivier Grisel
Gael: this can be an example without the plot_ prefix and in this case
the data does not have to be downloaded to build the documentation (at
least I hope).
First of all that's clearly sacrificing the quality of the
documentations, in my eyes. Second, its not only a question of building
the documentation, it's a question of defining a clear cut ensemble of
files with which the scikit is complete.
Post by Olivier Grisel
I don't think any dataset of the scikit And indeed this
is a typical application of the Laloudouana principled theory for
dataset selection :)
OK, fair enough. So we need to think about an additional dataset, not too
big, and that has other uses.

Gael
Loading...