[Scikit-learn-general] Combine functionality for text feature/image feature pipeline

Discussion:

michael kneier

2014-02-27 07:33:38 UTC

Hi all,

I would like to add a "combiner" class which would work with pipeline to allow users to augment the output of scikit's text feature extraction process (or other feature extraction processes). For example, after apply CountVectorizer, it is sometime desirable to augment the resulting dataset with additional features. Unless I am missing something, this is not easily done if the count vectorization is being used in a pipeline, especially if CountVectorizer parameters such as min_df are being optimized along with downstream model parameters.

After I have written code for this class, what is the easiest way to get it reviewed/incorporated into scikit?

Thanks,
Mike Kneier

Alexandre Gramfort

2014-02-27 07:37:32 UTC

Permalink

hi,

do you know:

http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html

?

it might do already what you want

A

On Thu, Feb 27, 2014 at 8:33 AM, michael kneier

Post by michael kneier
Hi all,
I would like to add a "combiner" class which would work with pipeline to allow users to augment the output of scikit's text feature extraction process (or other feature extraction processes). For example, after apply CountVectorizer, it is sometime desirable to augment the resulting dataset with additional features. Unless I am missing something, this is not easily done if the count vectorization is being used in a pipeline, especially if CountVectorizer parameters such as min_df are being optimized along with downstream model parameters.
After I have written code for this class, what is the easiest way to get it reviewed/incorporated into scikit?
Thanks,
Mike Kneier
------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Michael Kneier

2014-02-27 15:30:20 UTC

Permalink

The problem with FeatureUnion is that it can only combine the output of two transformers. I think it would be great to have a simple method of combining the result of a transformer with extrenal/untransformed data within a pipeline.

Sent from my iPhone

Post by Alexandre Gramfort
hi,
http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
?
it might do already what you want
A
On Thu, Feb 27, 2014 at 8:33 AM, michael kneier

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Lars Buitinck

2014-02-27 16:10:04 UTC

Permalink

Post by michael kneier
I would like to add a "combiner" class which would work with pipeline to allow users to augment the output of scikit's text feature extraction process (or other feature extraction processes). For example, after apply CountVectorizer, it is sometime desirable to augment the resulting dataset with additional features. Unless I am missing something, this is not easily done if the count vectorization is being used in a pipeline, especially if CountVectorizer parameters such as min_df are being optimized along with downstream model parameters.

CountVectorizer is very customizable. You can give a custom analyzer
that extracts the features you want:

CountVectorizer(analyzer=features)

where features is some custom function that gets either a filename or
a file's content (as a string) and returns whatever features you want.
The only downside is that all the features are going to be counted, so
things like timestamps aren't going to be handled nicely.

If that doesn't do the trick, have a look at DictVectorizer. That's
even more flexible: you give it dicts mapping feature names to
(numeric or string) values. It will build a matrix representation
using booleans in place of string values, but it will leave the
numeric values untouched.

Joel Nothman

2014-02-27 22:37:33 UTC

Permalink

Post by Michael Kneier
The problem with FeatureUnion is that it can only combine the output of two transformers. I think it would be great to have a simple method of combining the result of a transformer with extrenal/untransformed data within a pipeline.

I think it would be nice if the FeatureUnion makes it easy to extract
only certain parts of the input for each transformer.
https://github.com/scikit-learn/scikit-learn/issues/2034 intends to
cover this issue, but we haven't resolved a clean API.

Suggestions are welcome!

- Joel

Post by Michael Kneier

CountVectorizer is very customizable. You can give a custom analyzer
CountVectorizer(analyzer=features)
where features is some custom function that gets either a filename or
a file's content (as a string) and returns whatever features you want.
The only downside is that all the features are going to be counted, so
things like timestamps aren't going to be handled nicely.
If that doesn't do the trick, have a look at DictVectorizer. That's
even more flexible: you give it dicts mapping feature names to
(numeric or string) values. It will build a matrix representation
using booleans in place of string values, but it will leave the
numeric values untouched.
------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Lars Buitinck

2014-02-27 23:11:09 UTC

Permalink

Post by Joel Nothman
I think it would be nice if the FeatureUnion makes it easy to extract
only certain parts of the input for each transformer.
https://github.com/scikit-learn/scikit-learn/issues/2034 intends to
cover this issue, but we haven't resolved a clean API.
Suggestions are welcome!

Michael Kneier

2014-02-28 01:37:44 UTC

Permalink

Thanks for the great replies. As Lars rightly points out, I could define a
custom transform to accomplish the combining.

I do think that this could be more intuitively implemented (or at least
built in to FeatureUnion), and I'd like pitch in on the
https://github.com/scikit-learn/scikit-learn/issues/2034. I will take a
closer look this weekend.

Thanks,
Mike

Post by Lars Buitinck

I hope you don't mind me replying here: I think this can be resolved
by custom transformers that pass through a user-specified set of
columns. My preferred way of implementing that would be a generic,
stateless transformer class that just runs a function on X in
transform and returns the result. If this transformer doesn't do input
validation, you could make a union
make_pipeline(FunctionTransformer(extract_description_terms),
TfidfTransformer())
âª
make_pipeline(FunctionTransformer(extract_portrait_pixels), PCA())
and feed this filenames, or dicts, or whatever. The original problem
of letting though only some columns is then
X = np.asarray(X)
return X[:, ::2]
FunctionTransformer(even_columns)
And of course, these things are more generally useful for inserting a
simple function in the middle of a pipeline.
------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Joel Nothman

2014-02-28 01:54:43 UTC

Permalink

Post by Lars Buitinck
My preferred way of implementing that would be a generic,
stateless transformer class that just runs a function on X in
transform and returns the result.

I think this is useful anyway, and an effective but not ideal solution
for this use-case. Here that makes a lot of overhead for what is
really a straightforward application.

Post by Lars Buitinck
Thanks for the great replies. As Lars rightly points out, I could define a
custom transform to accomplish the combining.
I do think that this could be more intuitively implemented (or at least
built in to FeatureUnion), and I'd like pitch in on the
https://github.com/scikit-learn/scikit-learn/issues/2034. I will take a
closer look this weekend.
Thanks,
Mike

Post by Lars Buitinck

I hope you don't mind me replying here: I think this can be resolved
by custom transformers that pass through a user-specified set of
columns. My preferred way of implementing that would be a generic,
stateless transformer class that just runs a function on X in
transform and returns the result. If this transformer doesn't do input
validation, you could make a union
make_pipeline(FunctionTransformer(extract_description_terms),
TfidfTransformer())
∪
make_pipeline(FunctionTransformer(extract_portrait_pixels), PCA())
and feed this filenames, or dicts, or whatever. The original problem
of letting though only some columns is then
X = np.asarray(X)
return X[:, ::2]
FunctionTransformer(even_columns)
And of course, these things are more generally useful for inserting a
simple function in the middle of a pipeline.
------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general