Matthias Feurer
2016-05-07 09:12:36 UTC
Dear scikit-learn team,
First of all, the model selection module is really easy to use and has a
nice and clean interface, I really like that. Nevertheless, while using
it for benchmarks I found some shortcomings where I think the module
could be improved.
1. Return the fit and predict time in `grid_scores_`
BaseSearchCV relies on a function called _fit_and_score to produce the
entries in grid_scores_. This function measures the time it takes to fit
a model, predict for the (cross-)validation set and calculate the score.
It returns this time, which is then discarded:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_search.py#L569
I propose to store this time in grid_scores_ and make it accessible to
the user. Also, the time taken to refit the model in line 596 and
following should be measured and made accessible to the user.
2. Add distribution objects to scikit-learn which have get_params and
set_params attributes
When printing the parameter distribution proposed for the model
selection module (scipy.stats), the result is something which cannot be
parsed:
<scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff59d8fd6d8>
It's also not possible to access this with the scikit-learn like methods
get_params() and set_params() (actually, the first of both should
suffice). I propose to add distribution objects for commonly used
distributions:
1. Categorical variables - replace previously used lists
2. RandInt - replace scipy.stats.randint
3. Uniform - might replace scipy.stats.uniform, I'm not sure if that
would accept a lower and an upper bound at construction time
4. LogUniform - does not exist so far, useful for search C and gamma in
SVMs, learning rate in NNs etc.
5. LogUniformInt - same thing, but as an Integer, useful for the
min_samples_split in RF and ET
6. MultipleUniformInt - this is a bit weird as it would return a tuple
of Integers, but I could not find any other way to tune both the number
of hidden layers and their size in the MLPClassifier
3. Add get_params and set_params to CV objects
Currently, the CV objects like StratifiedKFold look nice when printed,
but it is not possible to access their parameters programatically in
order to serialize them (without pickle). Since they are part of the
BaseSearchCV and returned by a call to BaseSearchCV.get_params(), I
propose to add parameter setter and getter to the CV objects as well to
maintain a consistent interface.
I think these changes are not too hard to implement and I am willing to
do so if you approve these suggestions.
Best regards,
Matthias
First of all, the model selection module is really easy to use and has a
nice and clean interface, I really like that. Nevertheless, while using
it for benchmarks I found some shortcomings where I think the module
could be improved.
1. Return the fit and predict time in `grid_scores_`
BaseSearchCV relies on a function called _fit_and_score to produce the
entries in grid_scores_. This function measures the time it takes to fit
a model, predict for the (cross-)validation set and calculate the score.
It returns this time, which is then discarded:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_search.py#L569
I propose to store this time in grid_scores_ and make it accessible to
the user. Also, the time taken to refit the model in line 596 and
following should be measured and made accessible to the user.
2. Add distribution objects to scikit-learn which have get_params and
set_params attributes
When printing the parameter distribution proposed for the model
selection module (scipy.stats), the result is something which cannot be
parsed:
<scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff59d8fd6d8>
It's also not possible to access this with the scikit-learn like methods
get_params() and set_params() (actually, the first of both should
suffice). I propose to add distribution objects for commonly used
distributions:
1. Categorical variables - replace previously used lists
2. RandInt - replace scipy.stats.randint
3. Uniform - might replace scipy.stats.uniform, I'm not sure if that
would accept a lower and an upper bound at construction time
4. LogUniform - does not exist so far, useful for search C and gamma in
SVMs, learning rate in NNs etc.
5. LogUniformInt - same thing, but as an Integer, useful for the
min_samples_split in RF and ET
6. MultipleUniformInt - this is a bit weird as it would return a tuple
of Integers, but I could not find any other way to tune both the number
of hidden layers and their size in the MLPClassifier
3. Add get_params and set_params to CV objects
Currently, the CV objects like StratifiedKFold look nice when printed,
but it is not possible to access their parameters programatically in
order to serialize them (without pickle). Since they are part of the
BaseSearchCV and returned by a call to BaseSearchCV.get_params(), I
propose to add parameter setter and getter to the CV objects as well to
maintain a consistent interface.
I think these changes are not too hard to implement and I am willing to
do so if you approve these suggestions.
Best regards,
Matthias