Discussion:
Multiple metrics in grid search, etc. (again!)
(too old to reply)
Joel Nothman
2014-01-09 02:09:38 UTC
Permalink
Hi all,

I've had enough frustration at having to patch in things from a code fork
in order to merely get back precision and recall while optimising F1 in
grid search. This is something I need to do really frequently, as I'm sure
do others.

When I wrote and submitted PRs about this problem nine months ago, I
proposed relatively sophisticated solutions. Perhaps a simple, flexible
solution is appropriate:

GridSearchCV, RandomizedSearchCV, cross_val_score, and perhaps anything
else supporting 'scoring', should take an additional parameter, e.g.
'diagnostics', which is a callable with interface:
(estimator, X_train, y_train, X_test, y_test) -> object

The results of CV will include a params x folds array (or list of arrays)
to store each of these returned objects, whose dtype is automatically
detected, so that it may be compactly stored and easily accessed if desired.

So when scoring=f1, a diagnostic fn can be passed to calculate precision,
recall, etc., which means a bit of duplicated scoring work, but no
confusion of the existing scoring interface.

Scikit-learn may indeed provide ready-made diagnostic functions for certain
types of tasks. For example:

- a binary classification diagnostic might return P, R, F, AUC, AvgPrec;
- multiclass might add per-class performances, different averages and a
confusion matrix;
- a linear model diagnostic might measure model sparsity. (Perhaps the
parameter can take a sequence of callables to return a tuple of diagnostic
results per fold.)


As opposed to some of my more intricate proposals, this approach leaves it
to the user to do any averaging over folds etc.

*SearchCV should also store best_index_ more importantly than best_params_
so that this data can be cross-referenced. If the diagnostic output is a
list of arrays, rather than an array, the user can manually delete
information from the non-best trials, before saving the model to disk.

This also implies some refactoring of cross_val_score and fit_grid_point
that is overdue.

Does this seem the right level of complexity/flexibility? Please help me
and the many others who have requested it resolve this issue sooner rather
than later. I'd like to submit a PR towards this that actually gets
accepted, so some feedback is really welcome.

Cheers,

- Joel
Eustache DIEMERT
2014-01-09 07:48:44 UTC
Permalink
+1 for the "diagnostics" attribute

I've struggled with this in the past and the workaround I found was to
subclass my estimator to hook up the computation of additional metrics and
store the results into a new attribute like diagnostics.

Also, having a default set of diagnostics for different tasks is a must for
a practitioner-friendly library.

my 2c :)

Eustache
Post by Joel Nothman
Hi all,
I've had enough frustration at having to patch in things from a code fork
in order to merely get back precision and recall while optimising F1 in
grid search. This is something I need to do really frequently, as I'm sure
do others.
When I wrote and submitted PRs about this problem nine months ago, I
proposed relatively sophisticated solutions. Perhaps a simple, flexible
GridSearchCV, RandomizedSearchCV, cross_val_score, and perhaps anything
else supporting 'scoring', should take an additional parameter, e.g.
(estimator, X_train, y_train, X_test, y_test) -> object
The results of CV will include a params x folds array (or list of arrays)
to store each of these returned objects, whose dtype is automatically
detected, so that it may be compactly stored and easily accessed if desired.
So when scoring=f1, a diagnostic fn can be passed to calculate precision,
recall, etc., which means a bit of duplicated scoring work, but no
confusion of the existing scoring interface.
Scikit-learn may indeed provide ready-made diagnostic functions for
- a binary classification diagnostic might return P, R, F, AUC, AvgPrec;
- multiclass might add per-class performances, different averages and
a confusion matrix;
- a linear model diagnostic might measure model sparsity. (Perhaps the
parameter can take a sequence of callables to return a tuple of diagnostic
results per fold.)
As opposed to some of my more intricate proposals, this approach leaves it
to the user to do any averaging over folds etc.
*SearchCV should also store best_index_ more importantly than best_params_
so that this data can be cross-referenced. If the diagnostic output is a
list of arrays, rather than an array, the user can manually delete
information from the non-best trials, before saving the model to disk.
This also implies some refactoring of cross_val_score and fit_grid_point
that is overdue.
Does this seem the right level of complexity/flexibility? Please help me
and the many others who have requested it resolve this issue sooner rather
than later. I'd like to submit a PR towards this that actually gets
accepted, so some feedback is really welcome.
Cheers,
- Joel
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2014-01-14 03:59:23 UTC
Permalink
I'd definitely like to have support for multiple metrics. My use case is
that I have several methods that I want to evaluate against different
metrics and I want the hyper-parameters to be tuned against each metric. In
addition I don't have a test set so I need to use cross-validation both for
evaluation and hyper-parameter tuning.

A first change would be for cross_val_score to accept a list of scorers and
to return a n_folds x n_scorers array. This would only support a fixed set
of hyper-parameters but this change seems rather straightforward and
non-controversial. This would hopefully also serve as a basis for multiple
metrics grid search (can't fit_grid_point be replaced with
cross_val_score?).

When using multiple metrics, a major limitation of the current scorer API
is that it will recompute the predictions for each scorer. Unfortunately,
for kernel methods or random forests, computing the predictions is really
expensive.

I will study your solution more carefully when I have more time. Could you
also give a pointer to your previous proposed solution for comparison?

Mathieu
Post by Eustache DIEMERT
+1 for the "diagnostics" attribute
I've struggled with this in the past and the workaround I found was to
subclass my estimator to hook up the computation of additional metrics and
store the results into a new attribute like diagnostics.
Also, having a default set of diagnostics for different tasks is a must
for a practitioner-friendly library.
my 2c :)
Eustache
Post by Joel Nothman
Hi all,
I've had enough frustration at having to patch in things from a code fork
in order to merely get back precision and recall while optimising F1 in
grid search. This is something I need to do really frequently, as I'm sure
do others.
When I wrote and submitted PRs about this problem nine months ago, I
proposed relatively sophisticated solutions. Perhaps a simple, flexible
GridSearchCV, RandomizedSearchCV, cross_val_score, and perhaps anything
else supporting 'scoring', should take an additional parameter, e.g.
(estimator, X_train, y_train, X_test, y_test) -> object
The results of CV will include a params x folds array (or list of arrays)
to store each of these returned objects, whose dtype is automatically
detected, so that it may be compactly stored and easily accessed if desired.
So when scoring=f1, a diagnostic fn can be passed to calculate precision,
recall, etc., which means a bit of duplicated scoring work, but no
confusion of the existing scoring interface.
Scikit-learn may indeed provide ready-made diagnostic functions for
- a binary classification diagnostic might return P, R, F, AUC, AvgPrec;
- multiclass might add per-class performances, different averages and
a confusion matrix;
- a linear model diagnostic might measure model sparsity. (Perhaps
the parameter can take a sequence of callables to return a tuple of
diagnostic results per fold.)
As opposed to some of my more intricate proposals, this approach leaves
it to the user to do any averaging over folds etc.
*SearchCV should also store best_index_ more importantly than
best_params_ so that this data can be cross-referenced. If the diagnostic
output is a list of arrays, rather than an array, the user can manually
delete information from the non-best trials, before saving the model to
disk.
This also implies some refactoring of cross_val_score and fit_grid_point
that is overdue.
Does this seem the right level of complexity/flexibility? Please help me
and the many others who have requested it resolve this issue sooner rather
than later. I'd like to submit a PR towards this that actually gets
accepted, so some feedback is really welcome.
Cheers,
- Joel
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Joel Nothman
2014-01-14 07:16:55 UTC
Permalink
Firstly, yes, fit_grid_point is being replaced cross_val_score. It wants
you to review it! https://github.com/scikit-learn/scikit-learn/pull/2736

Secondly, my prior proposals include:

- On the mailing list in March 2013 I suggested a minimal code change
although not very user friendly approach: allow a scorer to return an
arbitrary object which would be stored, as long as it has __float__ to
convert it to a single objective for average and argmax.
- Similarly a scorer could return a tuple or array and the first is the
objective;
- Or a scorer could return a dict, and the entry with a particular label
is the objective.
- #1768 <https://github.com/scikit-learn/scikit-learn/pull/1768> takes
this approach. Scorers may return a list of (name, score) tuples. And the
name 'score' is the objective. Before storing, it prefixes the names with
'test_', and does similar for 'train_' given
`compute_training_score=True`.
[In that PR and in #1787, the data is then stored in a structured array,
which acts somewhat like a dict of arrays or array of dicts, and can be
sliced and reshaped, which is useful for parameter grids.
Structured arrays
have their issues
(#1787<https://github.com/scikit-learn/scikit-learn/pull/1787>),
so #2079 <https://github.com/scikit-learn/scikit-learn/pull/2079>goes
for returning and storing a dict.]

In all of the above, it is possible to get multiple scores without
duplicating prediction work (and it would make sense in general to provide
a PRFScorer instead of individually calculating F1, P and R). In my present
proposal, duplicate prediction must be done, but the API is arguably
simpler.

Thirdly, to summarise my latest proposal:

- Provide a way for the user to retrieve arbitrary data calculated from
the estimator at one "grid point".
- Don't make it exclusively/explicitly about scoring, so use a separate
parameter and more expansive callback args.
- This duplicates work done in scoring, does not presuppose any
particular use-case, and leaves the search with a single objective.
- As with scoring, useful measures can be provided by name.

Finally, on your proposal:

- I like some ideas of your solution, in which you can have multiple
objectives and hence best models, i.e. est.best_index_ could be an array,
and the corresponding est.best_params_. Yet I think there are many cases
where you don't actually want to find the best parameters for each metric
(e.g. P and R are only there to explain the F1 objective; multiclass
per-class vs average). Where there are multiple metrics, you also cannot
sensibly refit a best_estimator_ to which the search delegates its predict.
- Passing a list of scorers doesn't take advantage of already having
multiple metrics returned efficiently by a function (e.g. P,R,F; per-class
F1), besides the need to do an extra prediction which you already point
out. If each scorer were passed individually, you'd need a custom scorer
for each class in the per-class F1 case; or the outputs from each scorer
can be flattened and hstacked.
- Using a list of scorer names means this *can* be optimised to do
prediction as few times as possible, by grouping together those that
require thresholds and those that don't. This of course requires a rewrite
of scorer.py and is quite a complex solution.
- Having multiple objectives won't work with a more clever CV search
that is guided by the objective in selecting the next parameters to try.

- Joel
Post by Mathieu Blondel
I'd definitely like to have support for multiple metrics. My use case is
that I have several methods that I want to evaluate against different
metrics and I want the hyper-parameters to be tuned against each metric. In
addition I don't have a test set so I need to use cross-validation both for
evaluation and hyper-parameter tuning.
A first change would be for cross_val_score to accept a list of scorers
and to return a n_folds x n_scorers array. This would only support a fixed
set of hyper-parameters but this change seems rather straightforward and
non-controversial. This would hopefully also serve as a basis for multiple
metrics grid search (can't fit_grid_point be replaced with
cross_val_score?).
When using multiple metrics, a major limitation of the current scorer API
is that it will recompute the predictions for each scorer. Unfortunately,
for kernel methods or random forests, computing the predictions is really
expensive.
I will study your solution more carefully when I have more time. Could you
also give a pointer to your previous proposed solution for comparison?
Mathieu
Post by Eustache DIEMERT
+1 for the "diagnostics" attribute
I've struggled with this in the past and the workaround I found was to
subclass my estimator to hook up the computation of additional metrics and
store the results into a new attribute like diagnostics.
Also, having a default set of diagnostics for different tasks is a must
for a practitioner-friendly library.
my 2c :)
Eustache
Post by Joel Nothman
Hi all,
I've had enough frustration at having to patch in things from a code
fork in order to merely get back precision and recall while optimising F1
in grid search. This is something I need to do really frequently, as I'm
sure do others.
When I wrote and submitted PRs about this problem nine months ago, I
proposed relatively sophisticated solutions. Perhaps a simple, flexible
GridSearchCV, RandomizedSearchCV, cross_val_score, and perhaps anything
else supporting 'scoring', should take an additional parameter, e.g.
(estimator, X_train, y_train, X_test, y_test) -> object
The results of CV will include a params x folds array (or list of
arrays) to store each of these returned objects, whose dtype is
automatically detected, so that it may be compactly stored and easily
accessed if desired.
So when scoring=f1, a diagnostic fn can be passed to calculate
precision, recall, etc., which means a bit of duplicated scoring work, but
no confusion of the existing scoring interface.
Scikit-learn may indeed provide ready-made diagnostic functions for
- a binary classification diagnostic might return P, R, F, AUC, AvgPrec;
- multiclass might add per-class performances, different averages
and a confusion matrix;
- a linear model diagnostic might measure model sparsity. (Perhaps
the parameter can take a sequence of callables to return a tuple of
diagnostic results per fold.)
As opposed to some of my more intricate proposals, this approach leaves
it to the user to do any averaging over folds etc.
*SearchCV should also store best_index_ more importantly than
best_params_ so that this data can be cross-referenced. If the diagnostic
output is a list of arrays, rather than an array, the user can manually
delete information from the non-best trials, before saving the model to
disk.
This also implies some refactoring of cross_val_score and fit_grid_point
that is overdue.
Does this seem the right level of complexity/flexibility? Please help me
and the many others who have requested it resolve this issue sooner rather
than later. I'd like to submit a PR towards this that actually gets
accepted, so some feedback is really welcome.
Cheers,
- Joel
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2014-01-14 10:39:30 UTC
Permalink
Post by Joel Nothman
- I like some ideas of your solution, in which you can have multiple
objectives and hence best models, i.e. est.best_index_ could be an array,
and the corresponding est.best_params_. Yet I think there are many cases
where you don't actually want to find the best parameters for each metric
(e.g. P and R are only there to explain the F1 objective; multiclass
per-class vs average).
So it seems that we have different use cases. I want to find the best-tuned
estimator against each metric while you want to reuse computations from
GridSearchCV to make a multiple metric evaluation report. But then I am not
completely sure to see why you need to frame this within GridSearchCV.

My previous proposition was mainly for cross_val_score for the time being.
I actually think that supporting multiple scorers in GridSearchCV would be
problematic because GridSearchCV needs to behave like a predictor. So, we
would need a stateful API like:

gs = GridSearchCV(LinearSVC(), param_dict, scoring=["auc", "f1"])
gs.fit(X, y)
gs.set_best_estimator(scoring="auc")
gs.predict(X)
gs.set_best_estimator(scoring="f1")
gs.predict(X) # predictions may be different

For this reason, I think that a function that outputs the best estimators
for each scorer would be better:

best_estimators = multiple_grid_search(LinearSVC(), param_dict,
scoring=["auc", "f1"])
Post by Joel Nothman
-
- Passing a list of scorers doesn't take advantage of already having
multiple metrics returned efficiently by a function (e.g. P,R,F; per-class
F1), besides the need to do an extra prediction which you already point
out. If each scorer were passed individually, you'd need a custom scorer
for each class in the per-class F1 case; or the outputs from each scorer
can be flattened and hstacked.
I think evaluating the metric is orders of magnitude faster than computing
the predictions.
Post by Joel Nothman
- Using a list of scorer names means this *can* be optimised to do
prediction as few times as possible, by grouping together those that
require thresholds and those that don't. This of course requires a rewrite
of scorer.py and is quite a complex solution.
But I think that the fact that predictions must be recomputed every time
is a serious limitation of the current scorer API and should be addressed.

My solution would be for scorers to take not a triplet (estimator, X,
y_true) but a pair (y_true, y_score), where y_score is a *continuous*
output (output of decision_function). For metrics which need categorical
predictions, y_score can be converted in the scorer. The conversion would
rely on the fact that predict in classifiers is defined as the argmax of
decision_function.

This solution assumes that all classifiers have a decision_function. I
think that this is feasible, even for non-parametric estimators like kNN.
It also assumes that decision_function is defined as an alias to predict in
RegressorMixin. The log loss is the only metric that specifically need
probabilities but it can be re-implemented so as to take decision_function
outputs instead.

In any case, I can see the benefit of having a callback system in
GridSearchCV to let the user reuse some computations.

Mathieu
Joel Nothman
2014-01-14 11:13:58 UTC
Permalink
Post by Mathieu Blondel
My previous proposition was mainly for cross_val_score for the time being.
I consider them almost one and the same in terms of the information users
want out of them. The fact that cross_val_score is a function, not a class,
makes it more difficult to change the return format, but changing the
dimensions of the score array seems reasonable.
Post by Mathieu Blondel
My solution would be for scorers to take not a triplet (estimator, X,
y_true) but a pair (y_true, y_score), where y_score is a *continuous*
output (output of decision_function). For metrics which need categorical
predictions, y_score can be converted in the scorer.

I like this idea, broadly. I don't especially like the thought of
deprecating the scorer interface and parameter name already, but I think
this entails doing so. And the fact that it no longer has the same
interface as estimator.score suggests it should have a different name.
Post by Mathieu Blondel
The conversion would rely on the fact that predict in classifiers is
defined as the argmax of decision_function.

Or similar for multilabel, multi-output and binary...
Post by Mathieu Blondel
This solution assumes that all classifiers have a decision_function. I
think that this is feasible, even for non-parametric estimators like kNN.
It also assumes that decision_function is defined as an alias to predict in
RegressorMixin.

It also assumes that you're not going to cross-validate some other kind of
predictor, such as a clusterer (most don't support predict, and we already
don't handle fit_predict here).
Post by Mathieu Blondel
Post by Joel Nothman
- I like some ideas of your solution, in which you can have multiple
objectives and hence best models, i.e. est.best_index_ could be an array,
and the corresponding est.best_params_. Yet I think there are many cases
where you don't actually want to find the best parameters for each metric
(e.g. P and R are only there to explain the F1 objective; multiclass
per-class vs average).
So it seems that we have different use cases. I want to find the
best-tuned estimator against each metric while you want to reuse
computations from GridSearchCV to make a multiple metric evaluation report.
But then I am not completely sure to see why you need to frame this within
GridSearchCV.
My previous proposition was mainly for cross_val_score for the time being.
I actually think that supporting multiple scorers in GridSearchCV would be
problematic because GridSearchCV needs to behave like a predictor. So, we
gs = GridSearchCV(LinearSVC(), param_dict, scoring=["auc", "f1"])
gs.fit(X, y)
gs.set_best_estimator(scoring="auc")
gs.predict(X)
gs.set_best_estimator(scoring="f1")
gs.predict(X) # predictions may be different
For this reason, I think that a function that outputs the best estimators
best_estimators = multiple_grid_search(LinearSVC(), param_dict,
scoring=["auc", "f1"])
Post by Joel Nothman
-
- Passing a list of scorers doesn't take advantage of already having
multiple metrics returned efficiently by a function (e.g. P,R,F; per-class
F1), besides the need to do an extra prediction which you already point
out. If each scorer were passed individually, you'd need a custom scorer
for each class in the per-class F1 case; or the outputs from each scorer
can be flattened and hstacked.
I think evaluating the metric is orders of magnitude faster than
computing the predictions.
Post by Joel Nothman
- Using a list of scorer names means this *can* be optimised to do
prediction as few times as possible, by grouping together those that
require thresholds and those that don't. This of course requires a rewrite
of scorer.py and is quite a complex solution.
But I think that the fact that predictions must be recomputed every time
is a serious limitation of the current scorer API and should be addressed.
My solution would be for scorers to take not a triplet (estimator, X,
y_true) but a pair (y_true, y_score), where y_score is a *continuous*
output (output of decision_function). For metrics which need categorical
predictions, y_score can be converted in the scorer. The conversion would
rely on the fact that predict in classifiers is defined as the argmax of
decision_function.
This solution assumes that all classifiers have a decision_function. I
think that this is feasible, even for non-parametric estimators like kNN.
It also assumes that decision_function is defined as an alias to predict in
RegressorMixin. The log loss is the only metric that specifically need
probabilities but it can be re-implemented so as to take decision_function
outputs instead.
In any case, I can see the benefit of having a callback system in
GridSearchCV to let the user reuse some computations.
Mathieu
Mathieu Blondel
2014-01-14 12:58:53 UTC
Permalink
Post by Joel Nothman
I like this idea, broadly. I don't especially like the thought of
deprecating the scorer interface and parameter name already, but I think
this entails doing so. And the fact that it no longer has the same
interface as estimator.score suggests it should have a different name.
I was thinking we could keep __call__(self, estimator, X, y), at least for
some time, and add a new method evaluate(self, y_true, y_score).
need_threshold=False would be deprecated in favor of need_categories=True.

The most important is to not break GridSearchCV and cross_val_score. A few
deprecation warnings in the scorer API (in case people have created custom
scorers in their user-land code) are reasonable I think.

Mathieu
Mathieu Blondel
2014-01-14 17:43:26 UTC
Permalink
For the record, I've made some preliminary changes towards supporting
multiple metrics here:
https://github.com/mblondel/scikit-learn/commit/13bc90e35cb37cc4e054413057d8d7f0b29ef8a5

See my comments at the end of the page.

Mathieu
Post by Mathieu Blondel
Post by Joel Nothman
I like this idea, broadly. I don't especially like the thought of
deprecating the scorer interface and parameter name already, but I think
this entails doing so. And the fact that it no longer has the same
interface as estimator.score suggests it should have a different name.
I was thinking we could keep __call__(self, estimator, X, y), at least for
some time, and add a new method evaluate(self, y_true, y_score).
need_threshold=False would be deprecated in favor of need_categories=True.
The most important is to not break GridSearchCV and cross_val_score. A few
deprecation warnings in the scorer API (in case people have created custom
scorers in their user-land code) are reasonable I think.
Mathieu
Joel Nothman
2014-01-14 21:39:43 UTC
Permalink
Post by Mathieu Blondel
Post by Joel Nothman
I like this idea, broadly. I don't especially like the thought of
deprecating the scorer interface and parameter name already, but I think
this entails doing so. And the fact that it no longer has the same
interface as estimator.score suggests it should have a different name.
I was thinking we could keep __call__(self, estimator, X, y), at least for
some time, and add a new method evaluate(self, y_true, y_score).
need_threshold=False would be deprecated in favor of need_categories=True.
The most important is to not break GridSearchCV and cross_val_score. A few
deprecation warnings in the scorer API (in case people have created custom
scorers in their user-land code) are reasonable I think.
Mathieu
Joel Nothman
2014-01-14 21:40:39 UTC
Permalink
Post by Mathieu Blondel
Post by Joel Nothman
I like this idea, broadly. I don't especially like the thought of
deprecating the scorer interface and parameter name already, but I think
this entails doing so. And the fact that it no longer has the same
interface as estimator.score suggests it should have a different name.
I was thinking we could keep __call__(self, estimator, X, y), at least for
some time, and add a new method evaluate(self, y_true, y_score).
need_threshold=False would be deprecated in favor of need_categories=True.
The most important is to not break GridSearchCV and cross_val_score. A few
deprecation warnings in the scorer API (in case people have created custom
scorers in their user-land code) are reasonable I think.
Mathieu
Yes, I guess I took a similar approach (adding a named method to scorers)
in my first implementation of multiple metric scorers, when I didn't
realise the scorer API was still fluid.

Continue reading on narkive:
Loading...