[Scikit-learn-general] Suggestions for the model selection module

Discussion:

Matthias Feurer

2016-05-07 09:12:36 UTC

Dear scikit-learn team,

First of all, the model selection module is really easy to use and has a
nice and clean interface, I really like that. Nevertheless, while using
it for benchmarks I found some shortcomings where I think the module
could be improved.

1. Return the fit and predict time in `grid_scores_`

BaseSearchCV relies on a function called _fit_and_score to produce the
entries in grid_scores_. This function measures the time it takes to fit
a model, predict for the (cross-)validation set and calculate the score.
It returns this time, which is then discarded:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_search.py#L569

I propose to store this time in grid_scores_ and make it accessible to
the user. Also, the time taken to refit the model in line 596 and
following should be measured and made accessible to the user.

2. Add distribution objects to scikit-learn which have get_params and
set_params attributes

When printing the parameter distribution proposed for the model
selection module (scipy.stats), the result is something which cannot be
parsed:

<scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff59d8fd6d8>

It's also not possible to access this with the scikit-learn like methods
get_params() and set_params() (actually, the first of both should
suffice). I propose to add distribution objects for commonly used
distributions:

1. Categorical variables - replace previously used lists
2. RandInt - replace scipy.stats.randint
3. Uniform - might replace scipy.stats.uniform, I'm not sure if that
would accept a lower and an upper bound at construction time
4. LogUniform - does not exist so far, useful for search C and gamma in
SVMs, learning rate in NNs etc.
5. LogUniformInt - same thing, but as an Integer, useful for the
min_samples_split in RF and ET
6. MultipleUniformInt - this is a bit weird as it would return a tuple
of Integers, but I could not find any other way to tune both the number
of hidden layers and their size in the MLPClassifier

3. Add get_params and set_params to CV objects

Currently, the CV objects like StratifiedKFold look nice when printed,
but it is not possible to access their parameters programatically in
order to serialize them (without pickle). Since they are part of the
BaseSearchCV and returned by a call to BaseSearchCV.get_params(), I
propose to add parameter setter and getter to the CV objects as well to
maintain a consistent interface.

I think these changes are not too hard to implement and I am willing to
do so if you approve these suggestions.

Best regards,
Matthias

Joel Nothman

2016-05-07 12:41:53 UTC

Permalink

Post by Matthias Feurer
1. Return the fit and predict time in `grid_scores_`

This has been proposed for many years as part of an overhaul of
grid_scores_. The latest attempt is currently underway at
https://github.com/scikit-learn/scikit-learn/pull/6697, and has a good
chance of being merged.

Post by Matthias Feurer
2. Add distribution objects to scikit-learn which have get_params and
set_params attributes

Your use of get_params to perform serialisation is certainly not what
get_params is designed for, though I understand your use of it that way...
as long as all your parameters are either primitives or objects supporting
get_params. However, this is not by design. Further, param_distributions is
a dict whose values are scipy.stats rvs; get_params currently does not
traverse dicts, so this is already unfamiliar territory requiring a lot of
design, even once we were convinced that this were a valuable use-case,
which I am not certain of.

Post by Matthias Feurer
3. Add get_params and set_params to CV objects

Matthias Feurer

2016-05-07 12:56:31 UTC

Permalink

Dear Joel,

Thank you for taking the time to answer my email. I didn't see the PR on
this topic, thanks for pointing me to that. I can see your points with
regards to the get_params() method and it might be better if I write
more serialization code on my side (although for example
RandomizedSearchCV also returns a lot of parameters one would not
consider searching over).

Nevertheless, I still think it would be a good idea to have distribution
objects in scikit-learn since some common use cases cannot be easily
handled with scipy.stats (see my last email for examples).

Best regards,
Matthias

On 7 May 2016 at 19:12, Matthias Feurer
1. Return the fit and predict time in `grid_scores_`
This has been proposed for many years as part of an overhaul of
grid_scores_. The latest attempt is currently underway at
https://github.com/scikit-learn/scikit-learn/pull/6697, and has a good
chance of being merged.
2. Add distribution objects to scikit-learn which have get_params and
set_params attributes
Your use of get_params to perform serialisation is certainly not what
get_params is designed for, though I understand your use of it that
way... as long as all your parameters are either primitives or objects
supporting get_params. However, this is not by design. Further,
param_distributions is a dict whose values are scipy.stats rvs;
get_params currently does not traverse dicts, so this is already
unfamiliar territory requiring a lot of design, even once we were
convinced that this were a valuable use-case, which I am not certain of.
3. Add get_params and set_params to CV objects
get_params and set_params are intended to allow programmatic search
over those parameter settings. This is not often what one does with
the parameters of CV splitting methods, but I acknowledge that
supporting this would not be difficult. Still, if serialisation is the
purpose of this, it's not really the point.
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2016-05-08 21:49:18 UTC

Permalink

Hi Matthias.
Can you explain this point again?
Is it about the bad __repr__ ?

Thanks,
Andy

Post by Matthias Feurer
Dear Joel,
Thank you for taking the time to answer my email. I didn't see the PR
on this topic, thanks for pointing me to that. I can see your points
with regards to the get_params() method and it might be better if I
write more serialization code on my side (although for example
RandomizedSearchCV also returns a lot of parameters one would not
consider searching over).
Nevertheless, I still think it would be a good idea to have
distribution objects in scikit-learn since some common use cases
cannot be easily handled with scipy.stats (see my last email for
examples).
Best regards,
Matthias

On 7 May 2016 at 19:12, Matthias Feurer
1. Return the fit and predict time in `grid_scores_`
This has been proposed for many years as part of an overhaul of
grid_scores_. The latest attempt is currently underway at
https://github.com/scikit-learn/scikit-learn/pull/6697, and has a
good chance of being merged.
2. Add distribution objects to scikit-learn which have get_params and
set_params attributes
Your use of get_params to perform serialisation is certainly not what
get_params is designed for, though I understand your use of it that
way... as long as all your parameters are either primitives or
objects supporting get_params. However, this is not by design.
Further, param_distributions is a dict whose values are scipy.stats
rvs; get_params currently does not traverse dicts, so this is already
unfamiliar territory requiring a lot of design, even once we were
convinced that this were a valuable use-case, which I am not certain of.
3. Add get_params and set_params to CV objects
get_params and set_params are intended to allow programmatic search
over those parameter settings. This is not often what one does with
the parameters of CV splitting methods, but I acknowledge that
supporting this would not be difficult. Still, if serialisation is
the purpose of this, it's not really the point.
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Bharat Didwania 4-Yr B.Tech. Electrical Engg.

2016-05-09 03:32:18 UTC

Permalink

Respected sir,
on which project should i work on to increase my
chances for gsoc 2017 .please,i need some guidance.

Post by Andreas Mueller
Hi Matthias.
Can you explain this point again?
Is it about the bad __repr__ ?
Thanks,
Andy
Dear Joel,
Thank you for taking the time to answer my email. I didn't see the PR on
this topic, thanks for pointing me to that. I can see your points with
regards to the get_params() method and it might be better if I write more
serialization code on my side (although for example RandomizedSearchCV also
returns a lot of parameters one would not consider searching over).
Nevertheless, I still think it would be a good idea to have distribution
objects in scikit-learn since some common use cases cannot be easily
handled with scipy.stats (see my last email for examples).
Best regards,
Matthias
On 7 May 2016 at 19:12, Matthias Feurer <

Post by Matthias Feurer
1. Return the fit and predict time in `grid_scores_`

Post by Matthias Feurer
2. Add distribution objects to scikit-learn which have get_params and
set_params attributes

Post by Matthias Feurer
3. Add get_params and set_params to CV objects

get_params and set_params are intended to allow programmatic search over
those parameter settings. This is not often what one does with the
parameters of CV splitting methods, but I acknowledge that supporting this
would not be difficult. Still, if serialisation is the purpose of this,
it's not really the point.
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Matthias Feurer

2016-05-09 07:47:59 UTC

Permalink

Hi Andy,

Having distributions objects would be useful for several reasons:

1. Having a uniform way to programatically access the parameters of all
kinds of distribution objects. Currently, I could parse the 'args' item
in 'distribution.__dict__'. I don't know how important this is for
others, though.
2. Having a helpful __repr__. Currently, printing a distribution does

Post by Andreas Mueller

Post by Matthias Feurer

uniform = scipy.stats.uniform(3,5)
print(uniform)

<scipy.stats._distn_infrastructure.rv_frozen object at 0x7f1a61657898>

3. Some useful distributions aren't easily possible with scipy.stats.
Can you please give me examples for:
* tuning the number of layers and the number of hidden neurons of
the MLPClassifier?
* tuning C and gamma of SVC on a log scale between 2^12 and 2^12?
I couldn't find appropriate objects in scipy.stats and ended up defining
my own.

Best,
Matthias

to have a useful representation of distribution __repr__), and finally
to have distributions

Post by Andreas Mueller
Hi Matthias.
Can you explain this point again?
Is it about the bad __repr__ ?
Thanks,
Andy

On 7 May 2016 at 19:12, Matthias Feurer
1. Return the fit and predict time in `grid_scores_`
This has been proposed for many years as part of an overhaul of
grid_scores_. The latest attempt is currently underway at
https://github.com/scikit-learn/scikit-learn/pull/6697, and has a
good chance of being merged.
2. Add distribution objects to scikit-learn which have
get_params and
set_params attributes
Your use of get_params to perform serialisation is certainly not
what get_params is designed for, though I understand your use of it
that way... as long as all your parameters are either primitives or
objects supporting get_params. However, this is not by design.
Further, param_distributions is a dict whose values are scipy.stats
rvs; get_params currently does not traverse dicts, so this is
already unfamiliar territory requiring a lot of design, even once we
were convinced that this were a valuable use-case, which I am not
certain of.
3. Add get_params and set_params to CV objects
get_params and set_params are intended to allow programmatic search
over those parameter settings. This is not often what one does with
the parameters of CV splitting methods, but I acknowledge that
supporting this would not be difficult. Still, if serialisation is
the purpose of this, it's not really the point.
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gael Varoquaux

2016-05-09 08:04:40 UTC

Permalink

Shouldn't these be improvements to scipy, rather than live in
scikit-learn?

Gaël

Post by Matthias Feurer
Hi Andy,
1. Having a uniform way to programatically access the parameters of all kinds
of distribution objects. Currently, I could parse the 'args' item in
'distribution.__dict__'. I don't know how important this is for others, though.
2. Having a helpful __repr__. Currently, printing a distribution does not even

uniform = scipy.stats.uniform(3, 5)
print(uniform)

<scipy.stats._distn_infrastructure.rv_frozen object at 0x7f1a61657898>
3. Some useful distributions aren't easily possible with scipy.stats. Can you
* tuning the number of layers and the number of hidden neurons of the
MLPClassifier?
* tuning C and gamma of SVC on a log scale between 2^12 and 2^12?
I couldn't find appropriate objects in scipy.stats and ended up defining my
own.
Best,
Matthias
to have a useful representation of distribution __repr__), and finally to have
distributions
Hi Matthias.
Can you explain this point again?
Is it about the bad __repr__ ?
Thanks,
Andy
Dear Joel,
Thank you for taking the time to answer my email. I didn't see the PR
on this topic, thanks for pointing me to that. I can see your points
with regards to the get_params() method and it might be better if I
write more serialization code on my side (although for example
RandomizedSearchCV also returns a lot of parameters one would not
consider searching over).
Nevertheless, I still think it would be a good idea to have
distribution objects in scikit-learn since some common use cases cannot
be easily handled with scipy.stats (see my last email for examples).
Best regards,
Matthias
On 7 May 2016 at 19:12, Matthias Feurer <

1. Return the fit and predict time in `grid_scores_`
This has been proposed for many years as part of an overhaul of
grid_scores_. The latest attempt is currently underway at https://
github.com/scikit-learn/scikit-learn/pull/6697, and has a good
chance of being merged.

2. Add distribution objects to scikit-learn which have
get_params and
set_params attributes
Your use of get_params to perform serialisation is certainly not
what get_params is designed for, though I understand your use of it
that way... as long as all your parameters are either primitives or
objects supporting get_params. However, this is not by design.
Further, param_distributions is a dict whose values are scipy.stats
rvs; get_params currently does not traverse dicts, so this is
already unfamiliar territory requiring a lot of design, even once
we were convinced that this were a valuable use-case, which I am
not certain of.

3. Add get_params and set_params to CV objects
get_params and set_params are intended to allow programmatic search
over those parameter settings. This is not often what one does with
the parameters of CV splitting methods, but I acknowledge that
supporting this would not be difficult. Still, if serialisation is
the purpose of this, it's not really the point.
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Gael Varoquaux
Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info http://twitter.com/GaelVaroquaux