Discussion:
feature selection documentation : improvements ?
(too old to reply)
Eustache DIEMERT
2013-07-16 15:09:09 UTC
Permalink
Hi Sklearners,

I was trying out several feature selection methods of sklearn on the Arcene
dataset [1] and it occurred to me that despite the numerous examples [2] in
the docs, most of them were just plotting/printing most relevant features.

What is missing IMHO is a simple example on how to actually transform the
dataset after the initial feature selection !

I'm thinking something really simple but that I couldn't find anywhere like:

"""
clf = GradientBoostingClassifier()
clf.fit(X,y)

feats_mask = [ i > 1e-3 for i in clf.feature_importances_ ]
X = X.compress(feats_mask, axis=1)
clf.fit(X,y) # again, since we now operate only on selected features
"""

I think for new users such numpy array techniques could be a bit of a pain
to find and being a user friendly project we should incorporate such simple
techniques.

If you agree, what should I do :
- make a PR with a new example, perhaps more practically oriented ?
- or append a sample code (like X = X.compress(mask)) to the feature
selection narrative docs ?

I personally prefer the later option.

NB: the document classification example has some sample code, but just for
Chi2/SelectKBest
http://scikit-learn.org/dev/auto_examples/document_classification_20newsgroups.html

NB2: I dunno if X.compress is recommended on any kind of sparse matrices ?

Eustache

[1] http://archive.ics.uci.edu/ml/datasets/Arcene
[2] http://scikit-learn.org/dev/modules/feature_selection.html
Olivier Grisel
2013-07-16 17:49:56 UTC
Permalink
Feature selectors should implement the `Transformer` API so that they
can be used in a Pipeline and make it possible to cross validate them.

The univariate feature selectors already implement the transformer API:

http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

--
Olivier
Joel Nothman
2013-07-16 21:11:58 UTC
Permalink
For your example, Eustache, the following would work (with a dense or
sparse X):

"""
clf = GradientBoostingClassifier()
clf.fit(X, y)
clf.fit(clf.transform(threshold=1e-3), y)
"""

Alternatively, use a Pipeline:
"""
clf = Pipeline([
('sel', GradientBoostingClassifier()),
('clf', GradientBoostingClassifier())
])
clf.fit(X, y)
"""
This will apply the default threshold (1e-5); currently the threshold can't
be set for use in a pipeline, pending an issue that I can't currently
locate, which would move the threshold to the object as with randomized
l1's selection_threshold parameters.

The Pipeline examples include feature selectors, if only univariate. Is
there somewhere in the documentation you think these could be clearer? If
so, submit a PR.

- Joel
Post by Olivier Grisel
Feature selectors should implement the `Transformer` API so that they
can be used in a Pipeline and make it possible to cross validate them.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
--
Olivier
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Joel Nothman
2013-07-16 21:42:20 UTC
Permalink
Sorry, I made a mistake: unless the classifier has penalty=l1, its default
feature selection threshold (as used in a pipeline currently) is the mean
feature importance score.


On Wed, Jul 17, 2013 at 7:11 AM, Joel Nothman
Post by Joel Nothman
For your example, Eustache, the following would work (with a dense or
"""
clf = GradientBoostingClassifier()
clf.fit(X, y)
clf.fit(clf.transform(threshold=1e-3), y)
"""
"""
clf = Pipeline([
('sel', GradientBoostingClassifier()),
('clf', GradientBoostingClassifier())
])
clf.fit(X, y)
"""
This will apply the default threshold (1e-5); currently the threshold
can't be set for use in a pipeline, pending an issue that I can't currently
locate, which would move the threshold to the object as with randomized
l1's selection_threshold parameters.
The Pipeline examples include feature selectors, if only univariate. Is
there somewhere in the documentation you think these could be clearer? If
so, submit a PR.
- Joel
Post by Olivier Grisel
Feature selectors should implement the `Transformer` API so that they
can be used in a Pipeline and make it possible to cross validate them.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
--
Olivier
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2013-07-17 05:38:54 UTC
Permalink
Hey Joel,

I am afraid that I think that the GradientBoostingClassifier does not
implement the transform method.

Gaël
Post by Joel Nothman
Sorry, I made a mistake: unless the classifier has penalty=l1, its default
feature selection threshold (as used in a pipeline currently) is the mean
feature importance score.
For your example, Eustache, the following would work (with a dense or
"""
clf = GradientBoostingClassifier()
clf.fit(X, y)
clf.fit(clf.transform(threshold=1e-3), y)
"""
"""
clf = Pipeline([
    ('sel', GradientBoostingClassifier()),
    ('clf', GradientBoostingClassifier())
])
clf.fit(X, y)
"""
This will apply the default threshold (1e-5); currently the threshold can't
be set for use in a pipeline, pending an issue that I can't currently
locate, which would move the threshold to the object as with randomized
l1's selection_threshold parameters.
The Pipeline examples include feature selectors, if only univariate. Is
there somewhere in the documentation you think these could be clearer? If
so, submit a PR.
- Joel
Feature selectors should implement the `Transformer` API so that they
can be used in a Pipeline and make it possible to cross validate them.
http://scikit-learn.org/stable/modules/feature_selection.html#
univariate-feature-selection
--
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
Joel Nothman
2013-07-17 06:15:43 UTC
Permalink
Oh, well that's sad! Given that it assigns feature_importances_, is there
any reason it should not incorporate the mixin to provide it with
transform()? (I assumed that transform was available wherever
feature_importances_ was.)


On Wed, Jul 17, 2013 at 3:38 PM, Gael Varoquaux <
Post by Gael Varoquaux
Hey Joel,
I am afraid that I think that the GradientBoostingClassifier does not
implement the transform method.
Gaël
Post by Joel Nothman
Sorry, I made a mistake: unless the classifier has penalty=l1, its
default
Post by Joel Nothman
feature selection threshold (as used in a pipeline currently) is the mean
feature importance score.
On Wed, Jul 17, 2013 at 7:11 AM, Joel Nothman <
For your example, Eustache, the following would work (with a dense or
"""
clf = GradientBoostingClassifier()
clf.fit(X, y)
clf.fit(clf.transform(threshold=1e-3), y)
"""
"""
clf = Pipeline([
('sel', GradientBoostingClassifier()),
('clf', GradientBoostingClassifier())
])
clf.fit(X, y)
"""
This will apply the default threshold (1e-5); currently the
threshold can't
Post by Joel Nothman
be set for use in a pipeline, pending an issue that I can't currently
locate, which would move the threshold to the object as with
randomized
Post by Joel Nothman
l1's selection_threshold parameters.
The Pipeline examples include feature selectors, if only univariate.
Is
Post by Joel Nothman
there somewhere in the documentation you think these could be
clearer? If
Post by Joel Nothman
so, submit a PR.
- Joel
On Wed, Jul 17, 2013 at 3:49 AM, Olivier Grisel <
Feature selectors should implement the `Transformer` API so that
they
Post by Joel Nothman
can be used in a Pipeline and make it possible to cross validate
them.
Post by Joel Nothman
The univariate feature selectors already implement the
http://scikit-learn.org/stable/modules/feature_selection.html#
univariate-feature-selection
--
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Eustache DIEMERT
2013-07-17 07:15:52 UTC
Permalink
Mmm

Maybe just including the simple pipeline you provide in the feature
selection doc [1] would suffice to point to the recommended way to do that ?

Like a sub-sub-section dubbed "Including feature selection in a prediction
pipeline" ?

What do you think ?

Would it be too detailed ? should we let users figure this out themselves ?

[1] http://scikit-learn.org/dev/modules/feature_selection.html
Post by Joel Nothman
For your example, Eustache, the following would work (with a dense or
"""
clf = GradientBoostingClassifier()
clf.fit(X, y)
clf.fit(clf.transform(threshold=1e-3), y)
"""
"""
clf = Pipeline([
('sel', GradientBoostingClassifier()),
('clf', GradientBoostingClassifier())
])
clf.fit(X, y)
"""
This will apply the default threshold (1e-5); currently the threshold
can't be set for use in a pipeline, pending an issue that I can't currently
locate, which would move the threshold to the object as with randomized
l1's selection_threshold parameters.
The Pipeline examples include feature selectors, if only univariate. Is
there somewhere in the documentation you think these could be clearer? If
so, submit a PR.
- Joel
Post by Olivier Grisel
Feature selectors should implement the `Transformer` API so that they
can be used in a Pipeline and make it possible to cross validate them.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
--
Olivier
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2013-07-17 09:10:25 UTC
Permalink
I agree that the narrative feature selection documentation should
include an inline toy example to demonstrate how to combine a selector
transformer in a pipeline as this is the canonical way to use a
feature selection, especially if you want to cross validate the impact
oft he feature selection hyper parameters on the final performance
metrics of the whole pipeline.

--
Olivier
Eustache DIEMERT
2013-07-17 07:06:58 UTC
Permalink
Post by Olivier Grisel
Feature selectors should implement the `Transformer` API so that they
can be used in a Pipeline and make it possible to cross validate them.
That's what I thought too. Do we have an example of cross-validation
feature selection + learning ?
Right, somehow I missed this part :(
Gael Varoquaux
2013-07-16 18:32:53 UTC
Permalink
Post by Eustache DIEMERT
What is missing IMHO is a simple example on how to actually transform the
dataset after the initial feature selection !
I beg to disagree. We have a huge amount of examples. Probably too many.
We need to move people away from copy-pasting examples, and have them
actually learn the API of the package, and numpy: teaching people to fish
rather than giving them a fish.
Post by Eustache DIEMERT
"""
clf = GradientBoostingClassifier()
clf.fit(X,y)
feats_mask = [ i > 1e-3 for i in clf.feature_importances_ ]
X = X.compress(feats_mask, axis=1)
Yes. Learn numpy. Seriously, this may sound provocative but it's the
biggest favor you can do yourself. It is a vast library, and does require
some learning indeed. And by the way, if you new numpy, you would now
that what you have written above is very inefficient and you could write
something like "X = X[:, clf.feature_importances_ > 1e-3]"

That said, as Olivier mentioned, the GradientBoostingClassifier could
implement a "transform", and that might be a good idea.

Cheers,

Gaël

PS: Sorry if I come out a bit harsh, I had a pretty bad day fighting with
administration.
Eustache DIEMERT
2013-07-17 07:09:02 UTC
Permalink
Post by Gael Varoquaux
Yes. Learn numpy. Seriously, this may sound provocative but it's the
biggest favor you can do yourself.
Ok, then for folks like me that come to numpy because (thanks to) sklearn,
than why not point a (few) good tutorials somewhere in the docs ?

I mean if it's an implicit requirement, then let's make it explicit and
encourage people to learn it :)
Post by Gael Varoquaux
And by the way, if you (k)new numpy, you would now
that what you have written above is very inefficient and you could write
something like "X = X[:, clf.feature_importances_ > 1e-3]"
How did you know ;D
Post by Gael Varoquaux
That said, as Olivier mentioned, the GradientBoostingClassifier could
implement a "transform", and that might be a good idea.
Ok, then maybe that's something I can tackle if it's not to hairy ?
Post by Gael Varoquaux
PS: Sorry if I come out a bit harsh, I had a pretty bad day fighting with
administration.
no pb, since you acknowledge it !
Gael Varoquaux
2013-07-17 09:35:49 UTC
Permalink
Ok, then for folks like me that come to numpy because (thanks to) sklearn, than
why not point a (few) good tutorials somewhere in the docs ? 
Indeed. What would people think of pointing to the scipy-lectures
(http://scipy-lectures.github.io)?
I mean if it's an implicit requirement, then let's make it explicit and
encourage people to learn it :)
Good point!

G
Nelle Varoquaux
2013-07-17 11:47:19 UTC
Permalink
Post by Eustache DIEMERT
Ok, then for folks like me that come to numpy because (thanks to)
sklearn, than
Post by Eustache DIEMERT
why not point a (few) good tutorials somewhere in the docs ?
Indeed. What would people think of pointing to the scipy-lectures
(http://scipy-lectures.github.io)?
+1
Post by Eustache DIEMERT
I mean if it's an implicit requirement, then let's make it explicit and
encourage people to learn it :)
Good point!
G
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Eustache DIEMERT
2013-07-18 15:54:13 UTC
Permalink
Post by Gael Varoquaux
That said, as Olivier mentioned, the GradientBoostingClassifier could
Post by Gael Varoquaux
implement a "transform", and that might be a good idea.
Ok, then maybe that's something I can tackle if it's not to hairy ?
I tried something really dumb, but it seems to work in my case:

"""
class ExtGradientBoostingClassifier(GradientBoostingClassifier,
_LearntSelectorMixin):
pass

clf = ExtGradientBoostingClassifier()
clf.fit(X,y)

X = clf.transform(X)
X_valid = clf.transform(X_valid)

clf = GradientBoostingClassifier(n_estimators=100, max_depth=10)
clf.fit(X,y)
y_pred = clf.predict(X_valid)
...
"""

So I created a PR for this :
https://github.com/scikit-learn/scikit-learn/pull/2167

It seems that s.o. already added the pointer to the scipy tutorials though
:)

E/

Continue reading on narkive:
Loading...