Discussion:
Non-standard arguments to transform(...)
(too old to reply)
Joel Nothman
2013-05-16 13:51:18 UTC
Permalink
Behind some of the discussion at
https://github.com/scikit-learn/scikit-learn/issues/1963 is an underlying
issue of how to deal with transform methods that require a parameter other
than X.

I notice that there are some remnants of a former scikit-learn that allowed
parameters to fit(...) other than X and y, but these have been removed in
favour of object-level parameters.

Should the same be the case with transform? In particular, Pipeline and
FeatureUnion only support 1-ary transform, and any parameter that can only
be set as an additional argument to transform can't be varied in a [grid]
search. That is, >1-ary transform seems misfit to the API.

So what non-X arguments to transform methods take?


- copy=True in various, such as feature_extraction.text.TfidfTransformer
- y=None in various
- ridge_alpha=None in decomposition.sparse_pca.SparsePCA
- threshold=None in feature_selection.selector_mixin.SelectorMixin
- pooling_func=np.mean
in cluster._feature_agglomeration.AgglomerationTransform
- Y=None in pls._PLS, pls.PLSSVD

copy is not really a problem. It's not something you want to vary in
parameter search, and although perhaps the pipeline could take advantage of
it, it's no problem.

y is the subject of the first issue I mentioned.

In the case of ridge_alpha, it looks like this argument should be
deprecated (or left in as harmless?) as the code already backs off to an
object property.

For threshold, this additional parameter also makes it harder to adopt the
feature selection mixin. I propose that instead such estimators should have
a to_selector/to_transformer method that returns a feature selector where a
threshold parameter (or a lower and upper and limit threshold; see
https://github.com/scikit-learn/scikit-learn/pull/1939) can be played with
directly through the common scikit-learn parameters API (and therefore
[grid] search).

I haven't tried to understand pooling_func or Y so I don't get whether
they're necessary.

Thoughts?

- Joel
Gael Varoquaux
2013-05-16 14:01:14 UTC
Permalink
Post by Joel Nothman
I notice that there are some remnants of a former scikit-learn that allowed
parameters to fit(...) other than X and y, but these have been removed in
favour of object-level parameters.
Any parameter that is not strong data dependent should be an object-level
parameter. Ideally, I would have like all these parameters to be length-n
vectors, in order to be able to apply cross-validation data splitting to
them, but this is maybe an over-simplification.
Post by Joel Nothman
copy is not really a problem. It's not something you want to vary in parameter
search,
But it's not something that should depend on the data.
Post by Joel Nothman
y is the subject of the first issue I mentioned.
y seems legit. Our framework does not deal properly with it, but that's
probably a because.
Post by Joel Nothman
In the case of ridge_alpha, it looks like this argument should be deprecated
(or left in as harmless?) as the code already backs off to an object property.
Yes, it should be deprecated.
Post by Joel Nothman
For threshold, this additional parameter also makes it harder to adopt the
feature selection mixin. I propose that instead such estimators should have a
to_selector/to_transformer method that returns a feature selector where a
threshold parameter (or a lower and upper and limit threshold; see https://
github.com/scikit-learn/scikit-learn/pull/1939) can be played with directly
through the common scikit-learn parameters API (and therefore [grid] search).
I haven't tried to understand pooling_func or Y so I don't get whether they're
necessary.
Thoughts?
The above should be object-level params, I believe.

G
Joel Nothman
2013-05-16 22:01:43 UTC
Permalink
On Fri, May 17, 2013 at 12:01 AM, Gael Varoquaux <
Post by Joel Nothman
For threshold, this additional parameter also makes it harder to adopt
the
Post by Joel Nothman
feature selection mixin.
...
The above should be object-level params, I believe.
So the Forest, LibLinear, Perceptron and SGD implementations should all
have an additional `transform_threshold` object-level parameter?

I am not firmly devoted to the following alternative, but I think it is
worth considering:

Given one of these estimators, a perceptron, say, we can use:

perceptron.to_selector()

which returns an estimator whose parameters are generic things for feature
selection by score: a minimum score, maximum score, a limit on the number
of parameters returned from the max down... And it also has the underlying
estimator as a parameter so its parameters may be set in a grid search (see an
untested implementation<https://github.com/jnothman/scikit-learn/blob/feat_sel_overhaul/sklearn/feature_selection/selector_mixin.py#L11>
).

What I like about this is that it doesn't clutter what are generally used
as classifiers or regressors with their secondary purpose as feature
selectors. Instead of automatically considering a Perceptron a Transformer,
it allows the Perceptron to be explicitly reinterpreted as one: a Pipeline
including a Perceptron among its transformers looks strange; a Pipeline
including a Perceptron().to_selector() makes more sense.

- Joel
Gael Varoquaux
2013-05-17 05:37:49 UTC
Permalink
Post by Joel Nothman
perceptron.to_selector()
which returns an estimator whose parameters are generic things for feature
selection by score: a minimum score, maximum score,
I think that this is software sophistication that makes it harder to use
for people who are not used to complex software construct (think the
matlab 101 user), and I for this reason, am -1.
Vlad Niculae
2013-05-17 05:52:42 UTC
Permalink
Slightly off topic spilling of a jar of cents:

How about a metaestimator that is initialized with a (fitted)
estimator and has an API identical to pure feature selectors? That
would be more in the spirit of our API î think, but it's not much less
complicated than the way Gael thinks is too sophisticated.

On the other hand, we all know that users don't read docs, and having
awesome tools such as IPython makes it easy to guess what to do based
on an API. Do you think it's (intuitive/natural/expected) to find
.transform on a LinearSVC or on a Tree? Could you guess what it does?
Is there where you'd look for such a feature?

Rambling even further, once you transform with such a fitted model,
you (obviously?) can't predict the transformed data, because it has a
different dimension. So when you do feature selection with
trees/forests, the sequence is slightly strange:

1) fit on full-width data
2) transform training and test set
3) re-fit on transformed training data
4) predict the test set

Of course it all makes perfect sense but it's one of those things that
set API alarm bells in my head when I encountered it.

Cheers,
Vlad



On Fri, May 17, 2013 at 2:37 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Joel Nothman
perceptron.to_selector()
which returns an estimator whose parameters are generic things for feature
selection by score: a minimum score, maximum score,
I think that this is software sophistication that makes it harder to use
for people who are not used to complex software construct (think the
matlab 101 user), and I for this reason, am -1.
------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2013-05-17 05:55:20 UTC
Permalink
Post by Vlad Niculae
How about a metaestimator that is initialized with a (fitted)
estimator and has an API identical to pure feature selectors? That
would be more in the spirit of our API î think, but it's not much less
complicated than the way Gael thinks is too sophisticated.
I think that it is a good idea, because it is a pattern that we already
use, so in terms of learnability of the package, it is good.
Post by Vlad Niculae
Do you think it's (intuitive/natural/expected) to find .transform on a
LinearSVC or on a Tree? Could you guess what it does?
No, but I'd read the docstring.
Joel Nothman
2013-05-18 08:41:54 UTC
Permalink
On Fri, May 17, 2013 at 3:37 PM, Gael Varoquaux <
Post by Joel Nothman
Post by Joel Nothman
perceptron.to_selector()
which returns an estimator whose parameters are generic things for
feature
Post by Joel Nothman
selection by score: a minimum score, maximum score,
I think that this is software sophistication that makes it harder to use
for people who are not used to complex software construct (think the
matlab 101 user), and I for this reason, am -1.
I understand that. One reason I'm not wedded to it, but thought it was
worth proposing.

So you'd +1 the transform_threshold object parameter?

(And Vlad, my suggestion produces your meta-estimator, but ensures that
only compliant estimators can do so via a mixin to_selector method.)
Gael Varoquaux
2013-05-18 08:51:47 UTC
Permalink
Post by Joel Nothman
So you'd +1 the transform_threshold object parameter?
Yes
Lars Buitinck
2013-05-18 12:44:01 UTC
Permalink
Post by Joel Nothman
Post by Gael Varoquaux
I think that this is software sophistication that makes it harder to use
for people who are not used to complex software construct (think the
matlab 101 user), and I for this reason, am -1.
Agree...
Post by Joel Nothman
So you'd +1 the transform_threshold object parameter?
There's one issue with this, which is that grid searching
transform_threshold would re-train the estimator many times in the
loop to change a parameter that does not actually affect fit.

As regards methods vs. meta-estimators, I'm not too fond of extra
methods on classifiers to overload them with feature selection. If we
use a meta-estimator, then we can add an option to make it "forget"
the underlying estimator and keep only the mask. I'm currrently doing
this manually with linear SVMs, because in multiclass classification,
the coef_ is n_classes × n_features × sizeof(np.float64), while
n_features × sizeof(np.bool) suffices to do feature selection. At 3e6
features × 6 classes, this greatly reduces the stored model's size.

(Also, is it an idea to extend SelectKBest and SelectPercentile to
work with estimators that have feature_importances_?)
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Joel Nothman
2013-05-18 13:07:07 UTC
Permalink
Post by Lars Buitinck
Post by Joel Nothman
Post by Gael Varoquaux
I think that this is software sophistication that makes it harder to use
for people who are not used to complex software construct (think the
matlab 101 user), and I for this reason, am -1.
Agree...
Post by Joel Nothman
So you'd +1 the transform_threshold object parameter?
There's one issue with this, which is that grid searching
transform_threshold would re-train the estimator many times in the
loop to change a parameter that does not actually affect fit.
This is a more general problem in the current search implementation. But at
least such a parameter would suffice for Pipeline and FeatureUnion.

As regards methods vs. meta-estimators, I'm not too fond of extra
Post by Lars Buitinck
methods on classifiers to overload them with feature selection. If we
use a meta-estimator, then we can add an option to make it "forget"
the underlying estimator and keep only the mask.
Or perhaps the importances so that the mask can be calculated for different
thresholds, k-bests, etc.

I'm currrently doing
Post by Lars Buitinck
this manually with linear SVMs, because in multiclass classification,
the coef_ is n_classes × n_features × sizeof(np.float64), while
n_features × sizeof(np.bool) suffices to do feature selection. At 3e6
features × 6 classes, this greatly reduces the stored model's size.
(Also, is it an idea to extend SelectKBest and SelectPercentile to
work with estimators that have feature_importances_?)
One of the problems with extending SelectKBest and SelectPercentile to new
cases is that they expect a score_func that returns both scores and
pvalues. Otherwise your suggestion just involves exposing a
get_feature_importances method...?

I have implemented a more generic function mask_by_score (in
https://github.com/jnothman/scikit-learn/commits/feat_sel_overhaul) that
takes the following arguments (and others I'm less certain about):

* scores
* minimum bound
* maximum bound
* limit (a maximum number or fraction of features)

and have reimplemented Select* and the mixin currently under discussion in
terms of it (and also a SelectByScore wrapper).

One then also needs some no-op selectors, like SelectByIndices and
SelectByMask that wrap transformers around support masks extracted by
whatever means.

- Joel

Lars Buitinck
2013-05-16 14:03:01 UTC
Permalink
Post by Joel Nothman
I notice that there are some remnants of a former scikit-learn that allowed
parameters to fit(...) other than X and y, but these have been removed in
favour of object-level parameters.
That's right. The old convention was that fit would call _set_params,
then do actual training. The problem with this was that the method
could fail halfway through, without signaling to the caller what
exactly had failed, so we made set_params public and deprecated
passing estimator parameters to fit.
Post by Joel Nothman
Should the same be the case with transform? In particular, Pipeline and
FeatureUnion only support 1-ary transform, and any parameter that can only
be set as an additional argument to transform can't be varied in a [grid]
search. That is, >1-ary transform seems misfit to the API.
Yes, preferably the same should be the case for transform.
Post by Joel Nothman
So what non-X arguments to transform methods take?
copy=True in various, such as feature_extraction.text.TfidfTransformer
y=None in various
ridge_alpha=None in decomposition.sparse_pca.SparsePCA
threshold=None in feature_selection.selector_mixin.SelectorMixin
pooling_func=np.mean in
cluster._feature_agglomeration.AgglomerationTransform
Y=None in pls._PLS, pls.PLSSVD
copy is not really a problem. It's not something you want to vary in
parameter search, and although perhaps the pipeline could take advantage of
it, it's no problem.
[Aside: a consistent way of telling transformers that they own
whatever is passed to them could benefit pipeline performance, though
I never took the time to design an API for that. Putting fake copy=
arguments on all constructors or transform methods is ugly, this
should be handled in TransformerMixin.]
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Vlad Niculae
2013-05-16 14:25:06 UTC
Permalink
Post by Joel Nothman
Y=None in pls._PLS, pls.PLSSVD
I might be wrong, but this is actually just a normal `y=None`. It's
capitalized because that's standard notation for PLS and it emphasises
the multi-target nature of the algorithm. I tried changing it to
lowercase `y` once but I think we decided not to break the api for no
reason. Maybe it's a good time to discuss it again. Or maybe I
forget what the reason not to change it was.

Cheers,
Vlad
Post by Joel Nothman
Post by Joel Nothman
I notice that there are some remnants of a former scikit-learn that allowed
parameters to fit(...) other than X and y, but these have been removed in
favour of object-level parameters.
That's right. The old convention was that fit would call _set_params,
then do actual training. The problem with this was that the method
could fail halfway through, without signaling to the caller what
exactly had failed, so we made set_params public and deprecated
passing estimator parameters to fit.
Post by Joel Nothman
Should the same be the case with transform? In particular, Pipeline and
FeatureUnion only support 1-ary transform, and any parameter that can only
be set as an additional argument to transform can't be varied in a [grid]
search. That is, >1-ary transform seems misfit to the API.
Yes, preferably the same should be the case for transform.
Post by Joel Nothman
So what non-X arguments to transform methods take?
copy=True in various, such as feature_extraction.text.TfidfTransformer
y=None in various
ridge_alpha=None in decomposition.sparse_pca.SparsePCA
threshold=None in feature_selection.selector_mixin.SelectorMixin
pooling_func=np.mean in
cluster._feature_agglomeration.AgglomerationTransform
Y=None in pls._PLS, pls.PLSSVD
copy is not really a problem. It's not something you want to vary in
parameter search, and although perhaps the pipeline could take advantage of
it, it's no problem.
[Aside: a consistent way of telling transformers that they own
whatever is passed to them could benefit pipeline performance, though
I never took the time to design an API for that. Putting fake copy=
arguments on all constructors or transform methods is ugly, this
should be handled in TransformerMixin.]
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2013-05-16 15:57:52 UTC
Permalink
I tried changing it to lowercase `y` once but I think we decided not to
break the api for no reason. Maybe it's a good time to discuss it
again.
I am now thinking that it should be 'y' to avoid having a different API.
We could go through a deprecation cycle.

G
Joel Nothman
2013-05-16 21:31:21 UTC
Permalink
The reason I distinguished Y was both because of its case and because every
instance of transform(X, y=None) had not used y at all.
Post by Vlad Niculae
Post by Joel Nothman
Y=None in pls._PLS, pls.PLSSVD
I might be wrong, but this is actually just a normal `y=None`. It's
capitalized because that's standard notation for PLS and it emphasises
the multi-target nature of the algorithm. I tried changing it to
lowercase `y` once but I think we decided not to break the api for no
reason. Maybe it's a good time to discuss it again. Or maybe I
forget what the reason not to change it was.
Cheers,
Vlad
Post by Joel Nothman
Post by Joel Nothman
I notice that there are some remnants of a former scikit-learn that
allowed
Post by Joel Nothman
Post by Joel Nothman
parameters to fit(...) other than X and y, but these have been removed
in
Post by Joel Nothman
Post by Joel Nothman
favour of object-level parameters.
That's right. The old convention was that fit would call _set_params,
then do actual training. The problem with this was that the method
could fail halfway through, without signaling to the caller what
exactly had failed, so we made set_params public and deprecated
passing estimator parameters to fit.
Post by Joel Nothman
Should the same be the case with transform? In particular, Pipeline and
FeatureUnion only support 1-ary transform, and any parameter that can
only
Post by Joel Nothman
Post by Joel Nothman
be set as an additional argument to transform can't be varied in a
[grid]
Post by Joel Nothman
Post by Joel Nothman
search. That is, >1-ary transform seems misfit to the API.
Yes, preferably the same should be the case for transform.
Post by Joel Nothman
So what non-X arguments to transform methods take?
copy=True in various, such as feature_extraction.text.TfidfTransformer
y=None in various
ridge_alpha=None in decomposition.sparse_pca.SparsePCA
threshold=None in feature_selection.selector_mixin.SelectorMixin
pooling_func=np.mean in
cluster._feature_agglomeration.AgglomerationTransform
Y=None in pls._PLS, pls.PLSSVD
copy is not really a problem. It's not something you want to vary in
parameter search, and although perhaps the pipeline could take
advantage of
Post by Joel Nothman
Post by Joel Nothman
it, it's no problem.
[Aside: a consistent way of telling transformers that they own
whatever is passed to them could benefit pipeline performance, though
I never took the time to design an API for that. Putting fake copy=
arguments on all constructors or transform methods is ugly, this
should be handled in TransformerMixin.]
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Post by Joel Nothman
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Continue reading on narkive:
Loading...