Discussion:
Shrunken Centroid Classifier
(too old to reply)
Robert Layton
2012-03-12 03:35:01 UTC
Permalink
Hi All,

On reading some research, it appears that the shrunken centroid
classifier<http://www-stat.stanford.edu/%7Etibs/PAM/Rdist/howwork.html>is
one of the better methods for authorship analysis.
Therefore, I'm going to implement it at see if it really is, and I was
planning to add it to scikits.learn.

Before I start, I wanted to make sure it wasn't already in scikits.learn
under a different name (as I don't do much classification, I am not sure).
The method is basically like k-means clustering:
training: each class is represented by its centroid
testing: instances are assigned to the nearest centroid.

That is nearest centroid classification, while the "shrunken" bit basically
a feature selection.
Each centroid is moved towards the dataset centroid (set to 0) by a
threshold value. If any feature crosses over zero, it is set to zero,
effectively eliminating some features from the classification.

In my short research on the subject, I've seen two types of threshold. The
first is the absolute amount to move the point towards the dataset centroid
(i.e. 2.0 units), while the second is the number of features to reduce each
centroid to.

My question is: does scikits.learn have anything already? If not, I'll
start working on it soon.

Thanks,

Robert
--
Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)
Andreas
2012-03-12 08:30:11 UTC
Permalink
Hi Robert.
To me, this sounds somwhat like Linear Discriminant Analysis or rather
Quadratic Discriminant Analysis (without the shrinking part) to me.

In these methods, a Gaussian is fitted to each class and classification
is done by finding the Gaussian that most likely created a data point.

This is basically the same as finding the mean of each class and
classifying to the nearest using Mahalanobis distance.

I didn't look at the paper but that sounded quite related.

There is no probabilistic way to get the feature-selection shrinking in
this framework,
I guess, but of course you can always just set entries of the mean to zero.


Maybe you can take a closer look at these methods and work out
what the differences are.

Hope that helps,
Andy
Post by Robert Layton
Hi All,
On reading some research, it appears that the shrunken centroid
classifier
<http://www-stat.stanford.edu/%7Etibs/PAM/Rdist/howwork.html> is one
of the better methods for authorship analysis.
Therefore, I'm going to implement it at see if it really is, and I was
planning to add it to scikits.learn.
Before I start, I wanted to make sure it wasn't already in
scikits.learn under a different name (as I don't do much
classification, I am not sure).
training: each class is represented by its centroid
testing: instances are assigned to the nearest centroid.
That is nearest centroid classification, while the "shrunken" bit
basically a feature selection.
Each centroid is moved towards the dataset centroid (set to 0) by a
threshold value. If any feature crosses over zero, it is set to zero,
effectively eliminating some features from the classification.
In my short research on the subject, I've seen two types of threshold.
The first is the absolute amount to move the point towards the dataset
centroid (i.e. 2.0 units), while the second is the number of features
to reduce each centroid to.
My question is: does scikits.learn have anything already? If not, I'll
start working on it soon.
Thanks,
Robert
--
Public key at: http://pgp.mit.edu/ Search for this email address and
select the key from "2011-08-19" (key id: 54BA8735)
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Robert Layton
2012-03-12 09:11:07 UTC
Permalink
Post by Andreas
**
Hi Robert.
To me, this sounds somwhat like Linear Discriminant Analysis or rather
Quadratic Discriminant Analysis (without the shrinking part) to me.
In these methods, a Gaussian is fitted to each class and classification
is done by finding the Gaussian that most likely created a data point.
This is basically the same as finding the mean of each class and
classifying to the nearest using Mahalanobis distance.
I didn't look at the paper but that sounded quite related.
There is no probabilistic way to get the feature-selection shrinking in
this framework,
I guess, but of course you can always just set entries of the mean to zero.
Maybe you can take a closer look at these methods and work out
what the differences are.
Hope that helps,
Andy
Hi All,
On reading some research, it appears that the shrunken centroid classifier<http://www-stat.stanford.edu/%7Etibs/PAM/Rdist/howwork.html>is one of the better methods for authorship analysis.
Therefore, I'm going to implement it at see if it really is, and I was
planning to add it to scikits.learn.
Before I start, I wanted to make sure it wasn't already in scikits.learn
under a different name (as I don't do much classification, I am not sure).
training: each class is represented by its centroid
testing: instances are assigned to the nearest centroid.
That is nearest centroid classification, while the "shrunken" bit
basically a feature selection.
Each centroid is moved towards the dataset centroid (set to 0) by a
threshold value. If any feature crosses over zero, it is set to zero,
effectively eliminating some features from the classification.
In my short research on the subject, I've seen two types of threshold. The
first is the absolute amount to move the point towards the dataset centroid
(i.e. 2.0 units), while the second is the number of features to reduce each
centroid to.
My question is: does scikits.learn have anything already? If not, I'll
start working on it soon.
Thanks,
Robert
--
Public key at: http://pgp.mit.edu/ Search for this email address and
select the key from "2011-08-19" (key id: 54BA8735)
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Hi Andy,

That sounds pretty correct. My guess is that they are different but highly
related, as you said.
I'll do some investigation.

Thanks,

Robert
--
Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)
Olivier Grisel
2012-03-12 22:42:29 UTC
Permalink
Post by Robert Layton
Hi All,
On reading some research, it appears that the shrunken centroid classifier
is one of the better methods for authorship analysis.
Therefore, I'm going to implement it at see if it really is, and I was
planning to add it to scikits.learn.
Before I start, I wanted to make sure it wasn't already in scikits.learn
under a different name (as I don't do much classification, I am not sure).
training: each class is represented by its centroid
testing: instances are assigned to the nearest centroid.
I have it in a branch:

https://github.com/ogrisel/scikit-learn/tree/nearest-centroid

There is no tests, no doc. It works quite good on the olivetti faces
but very badly on the text data 20 newsgroups which is kind of
unexpected as kmeans is able to cluster the text data quite well. That
was kind of unexpected, investigating why it's bad on high dim sparse
data my help understand better the nature of text data.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Robert Layton
2012-03-13 00:49:10 UTC
Permalink
Post by Robert Layton
Post by Robert Layton
Hi All,
On reading some research, it appears that the shrunken centroid
classifier
Post by Robert Layton
is one of the better methods for authorship analysis.
Therefore, I'm going to implement it at see if it really is, and I was
planning to add it to scikits.learn.
Before I start, I wanted to make sure it wasn't already in scikits.learn
under a different name (as I don't do much classification, I am not
sure).
Post by Robert Layton
training: each class is represented by its centroid
testing: instances are assigned to the nearest centroid.
https://github.com/ogrisel/scikit-learn/tree/nearest-centroid
There is no tests, no doc. It works quite good on the olivetti faces
but very badly on the text data 20 newsgroups which is kind of
unexpected as kmeans is able to cluster the text data quite well. That
was kind of unexpected, investigating why it's bad on high dim sparse
data my help understand better the nature of text data.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Thanks Oliver,

I'll work off that template, and when I work out the details of the
shrinking parameters (specifically which one is more in use), I'll branch
and submit a PR.

- Robert
--
Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)
Olivier Grisel
2012-03-13 06:49:49 UTC
Permalink
Post by Robert Layton
I'll work off that template, and when I work out the details of the
shrinking parameters (specifically which one is more in use), I'll branch
and submit a PR.
Great. I think the nearest centroid is a very nice baseline classifier
for sanity check: fast to fit, fast to predict, zero hyper-paramete
and yet make reasonable assumption for many classification dataset (a
good example of high bias, low variance, the opposite of deep decision
trees or RBF kernels SVMs).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2012-03-13 07:04:38 UTC
Permalink
Post by Olivier Grisel
Great. I think the nearest centroid is a very nice baseline classifier
for sanity check: fast to fit, fast to predict, zero hyper-paramete
and yet make reasonable assumption for many classification dataset (a
good example of high bias, low variance, the opposite of deep decision
trees or RBF kernels SVMs).
It would be a nice addition to the neighbors module!

Mathieu
Andreas Mueller
2012-03-13 07:16:35 UTC
Permalink
Post by Olivier Grisel
Post by Robert Layton
I'll work off that template, and when I work out the details of the
shrinking parameters (specifically which one is more in use), I'll branch
and submit a PR.
Great. I think the nearest centroid is a very nice baseline classifier
for sanity check: fast to fit, fast to predict, zero hyper-paramete
and yet make reasonable assumption for many classification dataset (a
good example of high bias, low variance, the opposite of deep decision
trees or RBF kernels SVMs).
Have you ever compared it to LDA?
I would think the results are quite similar.
Robert Layton
2012-03-13 10:53:39 UTC
Permalink
Post by Olivier Grisel
Post by Olivier Grisel
Post by Robert Layton
I'll work off that template, and when I work out the details of the
shrinking parameters (specifically which one is more in use), I'll
branch
Post by Olivier Grisel
Post by Robert Layton
and submit a PR.
Great. I think the nearest centroid is a very nice baseline classifier
for sanity check: fast to fit, fast to predict, zero hyper-paramete
and yet make reasonable assumption for many classification dataset (a
good example of high bias, low variance, the opposite of deep decision
trees or RBF kernels SVMs).
Have you ever compared it to LDA?
I would think the results are quite similar.
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...