Discussion:
[Scikit-learn-general] Feature selection != feature elimination?
Philip Tully
2016-03-14 21:05:12 UTC
Permalink
Hi,

I'm trying to optimize the time it takes to make a prediction with my
model(s). I realized that when I perform feature selection during the
model fit(), that these features are likely still computed when I go
to predict() or predict_proba(). An optimization would then involve
actually eliminating those features that aren't selected from my
Pipeline altogether, instead of just selecting them.

Does sklearn already do this automatically? Or does this readjustment
need to be done manually before serialization?

thanks,
Philip
Joel Nothman
2016-03-14 23:20:34 UTC
Permalink
Currently there is no automatic mechanism for eliminating the generation of
features that are not selected downstream. It needs to be achieved manually.
Post by Philip Tully
Hi,
I'm trying to optimize the time it takes to make a prediction with my
model(s). I realized that when I perform feature selection during the
model fit(), that these features are likely still computed when I go
to predict() or predict_proba(). An optimization would then involve
actually eliminating those features that aren't selected from my
Pipeline altogether, instead of just selecting them.
Does sklearn already do this automatically? Or does this readjustment
need to be done manually before serialization?
thanks,
Philip
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Philip Tully
2016-05-02 15:06:39 UTC
Permalink
Cool, thanks for feedback!

Any outstanding PRs addressing something like this or anyone on this list
been thinking of/working on solutions?
I imagine it might be implemented as a step in a pipeline (eg.
FeatureRemover()) and be generally applicable / potentially benefit many
sklearners. Not sure if it could be compatible with HashingVectorizer though
Post by Joel Nothman
Currently there is no automatic mechanism for eliminating the generation
of features that are not selected downstream. It needs to be achieved
manually.
Post by Philip Tully
Hi,
I'm trying to optimize the time it takes to make a prediction with my
model(s). I realized that when I perform feature selection during the
model fit(), that these features are likely still computed when I go
to predict() or predict_proba(). An optimization would then involve
actually eliminating those features that aren't selected from my
Pipeline altogether, instead of just selecting them.
Does sklearn already do this automatically? Or does this readjustment
need to be done manually before serialization?
thanks,
Philip
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Sebastian Raschka
2016-05-02 15:20:45 UTC
Permalink
A little question regarding how it’s currently handled ...
So, if I have one of scikit-learn’s feature selectors in a pipeline, and it selected e.g., the features idx=[1, 12, 23] after “.fit”. Now, if I use “.predict" on that pipeline, wouldn’t the feature selectors transform method only pass X[:, idx] (where X is the input array and idx is something like [1, 12, 23]) to the next object in the pipeline, e.g., the estimator? That’s how I do it with my custom feature selection objects/algorithms, never looked under the hood of how scikit-learn feature selection implementations do it, so I am curious.

Best,
Sebastian
Post by Philip Tully
Cool, thanks for feedback!
Any outstanding PRs addressing something like this or anyone on this list been thinking of/working on solutions?
I imagine it might be implemented as a step in a pipeline (eg. FeatureRemover()) and be generally applicable / potentially benefit many sklearners. Not sure if it could be compatible with HashingVectorizer though
Currently there is no automatic mechanism for eliminating the generation of features that are not selected downstream. It needs to be achieved manually.
Hi,
I'm trying to optimize the time it takes to make a prediction with my
model(s). I realized that when I perform feature selection during the
model fit(), that these features are likely still computed when I go
to predict() or predict_proba(). An optimization would then involve
actually eliminating those features that aren't selected from my
Pipeline altogether, instead of just selecting them.
Does sklearn already do this automatically? Or does this readjustment
need to be done manually before serialization?
thanks,
Philip
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...