Discussion:
score_func discussion
(too old to reply)
Gael Varoquaux
2012-10-22 21:43:14 UTC
Permalink
Hi list,

Here is a long email. Hold on, we are having a hard time thinking a
design problem through. Andy and I have been discussing how to design the
score function API with regards to the cross-validation framework of the
scikit.

The discussion spanned from the fact that using AUC in GridSearchCV,
which is a comon usecase, is way too hard with our current code. However,
it does reveal more general limitations of our design. Currently, the
score method of an estimator defines the way that it is evaluated when
doing cross-validation. It can be overridden using the 'score_func'
argument of GridSearchCV and cross_val_score, that can be a callable with
(y_true, y_pred) as input parameters. This immediatly breaks when trying
to do e.g. AUC, as it needs non-thresholded decision logic.

Andy has proposed several pull requests that add to GridSearchCV logic to
deal with these usecases
https://github.com/scikit-learn/scikit-learn/pull/1198
https://github.com/scikit-learn/scikit-learn/pull/1014
I haven't been convinced so far, and we are trying to see what the right
direction is.

I think that Andy's approach can be summarized fairly as looking for a
way to extend the score_func in GridSearchCV to be able to deal with
richer scores. Amongst the options proposed are:

a. having the score method of object accept score_func argument, that
would than be used inside the score method:
https://github.com/scikit-learn/scikit-learn/pull/1198/files#L0R265

b. having score_func's signature be 'estimator, X, y' (discussed
offline).

The drawback that I see to option a is that the requirements of score
funcs are not homogeneous, and that all cannot apply to every estimator.
We are already seeing in the PRs that we need to define a
'requires_threshold' decoration. For some of my personnal usecases, I can
already see other signatures of score functions required. I really don't
like this, because it embeds custom code in parts of the scikit that are
general purpose. This pattern, in my experience, tend to create tight
coupling and to eventually lead to code that is harder to extend.

I must say that defining a meta-language defining capabilities of score
functions really raises warning signs as far as I am concerned. I find
that contracts based on imperative code are much easier to maintain and
extend than contracts based in declarative interfaces.


Option b seems fairly reasonnable from the design point of view. I think
that it is very versatile. The main drawback that I see, is that it does
not make the user's life easy to use various score functions existing in
the metrics module, as their signature is 'y_true, y_pred'.

However, option b really has the look and smell of a method to me.
Combine with the fact that some score functions need estimator-specific
information (i.e.: how to retrieve unthresholded decisions), it led me
to think that my favorite option would be to put as much as possible in
the estimator. The option that I am championing would be to add an
argument to estimators to be able to switch the score function. This
argument could either be a string, say 'auc', or a score_func with a
given signature (estimator specific, but we would try to have as little
of these as possible).

This option has the drawback of adding an argument to most, or all the
estimators. It also bundles together an estimator and a scoring logic,
which can be seen a pro or a con, depending on the point of view. One
clear pro that I see, is that a CV object, such as the LassoCV, would be
able to use the same logic in its inner cross-validation.


An aspect that came up during the discussion with Andy is that scoring
functions are a problem that is reasonnably well-defined in the case of
classification and regression. However, for unsupervised model, the
choice of scores is much wider, and, more importantly, these most often
need to have access to some model parameters. Thus, the only way that I
see to do scoring in these settings is as a method. For supervised
settings, the problem is usually a bit simpler, as it boils down to
comparing prediction to a ground truth. Andy has been tell me that model
selection with unsupervised estimation is a corner case that is not
officially supported with the scikit. I feel quite bad about this because
it is a significant fraction of my work, and because we designed the
scikit's API originally to cater together for supervised and unsupervised
models [1]. I think that original choice really payed off, and I
personnally really enjoy combining different estimators as black boxes.

For me, the estimators are really were the intelligence of the scikit
lies. I frown at the idea of putting intelligence elsewhere because it
makes things like implementing a grid-computing grid-search, or providing
your own estimator, harder. In addition, in my eyes, it is perfectly
valid to cook up an estimator that solve an atypical problem, because at
the end of the day, real problems don't fall in boxes like classification
or regression.

Sitting back, I am not really convinced by any of the options that I
listed above, I must confess. Maybe a good option, that I don't know how
to implement, would be:

* Support only score_func for supervised problems (removes the mess that
unsupervised model draw in for part of the API)
* Figure a way to have a uniform signature for score_func that would work
for all main usecases (removes the specification language in the score
functions that freaks me out :}). The difficulties are:
- unthresholded decisions
- multiclass

The goal of this email is to try to have a sane discussion of what are
the best choices in terms of simplicity of code, simplicity for the user,
and versatility for the scoring API. As I am writing this email, I get to
defend my point of view, but I hope that Andy will correct any false or
incomplete vision that I gave of his point of view.

Thanks for reading!

Gaël

[1] see http://sourceforge.net/mailarchive/message.php?msg_id=25077802
for one of the original discussion
Andreas Mueller
2012-10-22 22:26:46 UTC
Permalink
Thanks for the summary, Gael.
I think it takes everything into account.
I'd really like to move forward with this, so any opinions are welcome!

Your last suggestion also came to my mind after our discussion.
It would be possible to *only* support un-thresholded scores
as an interface. But that is somewhat cumbersome for the user,
and I think my first PR would be a better, albeit ugly, solution.

Cheers,
Andy
Mathieu Blondel
2012-10-23 07:21:06 UTC
Permalink
On Tue, Oct 23, 2012 at 6:43 AM, Gael Varoquaux <
Post by Gael Varoquaux
a. having the score method of object accept score_func argument, that
https://github.com/scikit-learn/scikit-learn/pull/1198/files#L0R265
b. having score_func's signature be 'estimator, X, y' (discussed
offline).
I think there are 2 different problems here:

1) Some metrics require more complicated input than just y_true and y_pred.

2) Some people have expressed the need to be able to call
estimator.score(X, y) with other
metrics than the default one.

Solving 2) doesn't solve the function signature problem of 1).

A question is how does a solution to these problems fit in our grid search
framework.
Post by Gael Varoquaux
The drawback that I see to option a is that the requirements of score
funcs are not homogeneous, and that all cannot apply to every estimator.
We are already seeing in the PRs that we need to define a
'requires_threshold' decoration. For some of my personnal usecases, I can
To be more general, I would rather call it `requires_prediction_score` (by
prediction score, I mean the output of predict_proba or decision_function).
Post by Gael Varoquaux
already see other signatures of score functions required. I really don't
like this, because it embeds custom code in parts of the scikit that are
general purpose. This pattern, in my experience, tend to create tight
coupling and to eventually lead to code that is harder to extend.
I must say that defining a meta-language defining capabilities of score
functions really raises warning signs as far as I am concerned. I find
that contracts based on imperative code are much easier to maintain and
extend than contracts based in declarative interfaces.
Option b seems fairly reasonnable from the design point of view. I think
that it is very versatile. The main drawback that I see, is that it does
not make the user's life easy to use various score functions existing in
the metrics module, as their signature is 'y_true, y_pred'.
Can you elaborate option b and how it would solve the function signature
problem?
Post by Gael Varoquaux
However, option b really has the look and smell of a method to me.
Combine with the fact that some score functions need estimator-specific
information (i.e.: how to retrieve unthresholded decisions), it led me
to think that my favorite option would be to put as much as possible in
the estimator. The option that I am championing would be to add an
argument to estimators to be able to switch the score function. This
argument could either be a string, say 'auc', or a score_func with a
given signature (estimator specific, but we would try to have as little
of these as possible).
I would rather avoid adding too many parameters to the constructor that
don't affect `fit`.
If I fit a model, I would expect to be able to evaluate it with different
metrics without creating a new estimator object.
Being able to just do estimator.score(X, y, "auc") would be very convenient.
Post by Gael Varoquaux
The goal of this email is to try to have a sane discussion of what are
the best choices in terms of simplicity of code, simplicity for the user,
and versatility for the scoring API. As I am writing this email, I get to
defend my point of view, but I hope that Andy will correct any false or
incomplete vision that I gave of his point of view.
I personally like the decorator approach as it allows us to declare what a
metrics expects although I understand your aversion to adding too much
framework code.
It would also allow to declare whether a metric is a score (bigger is
better) or a loss (smaller is better) and thus get rid of loss_func in
GridSearchCV.
Andreas Mueller
2012-10-23 08:40:08 UTC
Permalink
Post by Gael Varoquaux
However, option b really has the look and smell of a method to me.
Combine with the fact that some score functions need
estimator-specific
information (i.e.: how to retrieve unthresholded decisions), it led me
to think that my favorite option would be to put as much as possible in
the estimator. The option that I am championing would be to add an
argument to estimators to be able to switch the score function. This
argument could either be a string, say 'auc', or a score_func with a
given signature (estimator specific, but we would try to have as little
of these as possible).
I would rather avoid adding too many parameters to the constructor
that don't affect `fit`.
If I fit a model, I would expect to be able to evaluate it with
different metrics without creating a new estimator object.
Being able to just do estimator.score(X, y, "auc") would be very convenient.
This was the idea behind

https://github.com/scikit-learn/scikit-learn/pull/1198

Except that I used a function instead of a keyword - though having a keyword
(and maybe optionally a function?) would also be fine.

Gael had two arguments against that (iirc)
- stronger API requirements on the score function (i.e. that is has to
exist and that it accepts a score_func keyword)
- when doing a loop over grid-searches with several estimators, not all
of which support the same score_func,
the loop needs to iterate over pairs of estimators and score functions.
Olivier Grisel
2012-10-23 08:40:56 UTC
Permalink
I think we could keep the existing simple score / loss functions for
day to day manual validation of analysis output in an interactive
session for instance, while introducing a richer object oriented API
when working with model selection tools such as cross validation and
grid search.

For instance we could have:

```
class ROCAreaUnderCurveScore(object):

higher_is_better = True

def from_estimator(self, clf, X, y_expected):
if hasattar(clf, 'decision_function'):
y_predicted_thresholds = clf.decision_function(X)
elif hasattr(clf, 'predict_proba'):
y_predicted_thresholds = clf.predict_proba(X)
else:
raise TypeError("%r does not support thresholded predictions" % clf)
# TODO: check binary classification shape or raise ValueError
return self.from_decision_thresholds(y_expected, y_predicted_thresholds)

def from_decision_thresholds(self, expected, predicted_thresholds):
return auc_score(expected, predicted_thresholds)


class FScore(object):

higher_is_better = True

def __init__(self, beta):
self.beta = beta

def from_estimator(self, clf, X, y_expected):
# TODO: check input to provide meaningful ValueError or
TypeError to the caller
return fbeta_score(y_expected, clf.predict(X), beta=self.beta)

def from_multiclass_prediction(self, y_expected, y_predicted):
return fbeta_score(y_expected, y_predicted, beta=self.beta)


class RMSELoss(object):

higher_is_better = False

def from_estimator(self, clf, X, y_expected):
# TODO

def from_regression_prediction(self, y_expected, y_predicted):
# TODO

# Then later to address common use cases in a flat manner:

COMMON_SCORES = {
'roc_auc': ROCAreaUnderCurveScore(),
'f1': FScore(1.0),
'pr_auc': PRAreaUnderCurveScore(beta=1.0),
'rmse': RMSELoss(),
}
```

Then in GridSearchCV we can have a flat and convenient API for common
GridSearchCV(clf, score='roc_auc').fit(X, y)
... def __init__(self, some_param=1.0):
... self.some_param = some_param
... def from_decision_thresholds(self, expected, predicted_threshold):
... # do something with self.some_param, expected and
predicted_threshold
... return score_value
...
the_forty_two_custom_score = MyCustomScore(some_param=42.)
GridSearchCV(clf, score=the_forty_two_custom_score).fit(X, y)
This way we still have a flat API for 99% of the common use cases
while allowing to express richer semantics when needed (for instance
plugging a domain specific evaluation metric for your own research to
reproduce a domain-specifc benchmark or to compete on a kaggle
challenge).

We further use plain vanilla ducktyping instead of framework specific
helpers (decorators / DSL) to handle the complex case.

We can also provide a bunch of scorers mixin types / ABC to factorize
redundant code.

Finally we might want to later extend such as Scoring API to deal wrap
a validation set / OOB samples to do early stopping on a configurable
score for various scikit-learn models such as SGDClassifer, GBRT... I
have not really thought about that part yet but having score objects
rather than simple funcs / callables should probably make that much
easier.

WDYT?
--
Olivier
Andreas Mueller
2012-10-23 08:53:18 UTC
Permalink
WDYT?
Ok, so the tldr version is: option b), but use objects instead of functions.

Using objects instead of decorators seems more readable.
Currently, the only tag we need is "higher is better", right?

Why does the object need any other function apart from
"from_estimator", if it is only used in grid-search and
cross-validation?
Olivier Grisel
2012-10-23 09:18:49 UTC
Permalink
Post by Andreas Mueller
WDYT?
Ok, so the tldr version is: option b), but use objects instead of functions.
Using objects instead of decorators seems more readable.
Currently, the only tag we need is "higher is better", right?
Yes the other capabilities / requirements are encoded as the presence
or absence of the other methods.

Using classes instead of function also makes it possible to do easier
code reuse for checking inputs using mixin base classes.
Post by Andreas Mueller
Why does the object need any other function apart from
"from_estimator", if it is only used in grid-search and
cross-validation?
Those additional methods make it possible to declare capabilities /
requirements of the score function explicitly and programmatically
using duck typing. This is useful as a "documentation" IMHO: it will
be easy to find all the available regression scoring tools by having a
look at the class reference documentation for instance.

If people don't like the additional methods, we could introduce
additional boolean or categorical flags.

The `from_estimator` method call be renamed `__call__(self, estimator,
X, y=None)` as this is the main entry point for grid searching / cross
validation if people prefer.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Andreas Mueller
2012-10-23 09:33:10 UTC
Permalink
Post by Olivier Grisel
Post by Andreas Mueller
WDYT?
Ok, so the tldr version is: option b), but use objects instead of functions.
Using objects instead of decorators seems more readable.
Currently, the only tag we need is "higher is better", right?
Yes the other capabilities / requirements are encoded as the presence
or absence of the other methods.
Using classes instead of function also makes it possible to do easier
code reuse for checking inputs using mixin base classes.
Post by Andreas Mueller
Why does the object need any other function apart from
"from_estimator", if it is only used in grid-search and
cross-validation?
Those additional methods make it possible to declare capabilities /
requirements of the score function explicitly and programmatically
using duck typing. This is useful as a "documentation" IMHO: it will
be easy to find all the available regression scoring tools by having a
look at the class reference documentation for instance.
If people don't like the additional methods, we could introduce
additional boolean or categorical flags.
The `from_estimator` method call be renamed `__call__(self, estimator,
X, y=None)` as this is the main entry point for grid searching / cross
validation if people prefer.
Ok, +1 on this design.
Mathieu Blondel
2012-10-23 09:41:03 UTC
Permalink
Having a single method (I like the name `from_estimator`) does seem cleaner
and easier to follow.
Post by Olivier Grisel
If people don't like the additional methods, we could introduce
additional boolean or categorical flags.
Another solution is to use mixins (ClassificationMetric, RegressionMetric,
MulticlassMetric, ...) and use isinstance to inspect the capabilities.

Mathieu
Olivier Grisel
2012-10-23 09:43:58 UTC
Permalink
Post by Mathieu Blondel
Having a single method (I like the name `from_estimator`) does seem cleaner
and easier to follow.
Post by Olivier Grisel
If people don't like the additional methods, we could introduce
additional boolean or categorical flags.
Another solution is to use mixins (ClassificationMetric, RegressionMetric,
MulticlassMetric, ...) and use isinstance to inspect the capabilities.
I would rather avoid `isinstance` checks to keep our API ducktype
friendly for 3rd party integration that would not like to use our
mixins for one reason or another.

But +1 for mixins for code reuse.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2012-10-23 10:38:20 UTC
Permalink
Post by Olivier Grisel
I would rather avoid `isinstance` checks to keep our API ducktype
friendly for 3rd party integration that would not like to use our
mixins for one reason or another.
But GridSearchCV will use the generic `from_estimator` method and users
should directly use the usual metric functions already defined in
sklearn.metrics. So, I don't see the point of adding a third way to call
metrics (and all that extra code) just for the sake of duck-typing.

Mathieu
Andreas Mueller
2012-10-24 12:18:41 UTC
Permalink
On Tue, Oct 23, 2012 at 6:43 PM, Olivier Grisel
I would rather avoid `isinstance` checks to keep our API ducktype
friendly for 3rd party integration that would not like to use our
mixins for one reason or another.
But GridSearchCV will use the generic `from_estimator` method and
users should directly use the usual metric functions already defined
in sklearn.metrics. So, I don't see the point of adding a third way to
call metrics (and all that extra code) just for the sake of duck-typing.
Can any one explain to me again the use-case we are talking about?
What do we want to find out using isinstance or duck-typing?
Mathieu Blondel
2012-10-24 12:33:33 UTC
Permalink
On Wed, Oct 24, 2012 at 9:18 PM, Andreas Mueller
Post by Andreas Mueller
Can any one explain to me again the use-case we are talking about?
What do we want to find out using isinstance or duck-typing?
For example, Olivier gave this example:

class ROCAreaUnderCurveScore(object):
[...]

def from_decision_thresholds(self, expected, predicted_thresholds):
return auc_score(expected, predicted_thresholds)

By checking hasattr(metric, "from_decision_thresholds"), you would know
that this metric supports computing a score from decision thresholds.
Personally, I would rather not have three different ways to call the same
metric so I would prefer the solution based on tags or based on isinstance.

Mathieu
Mathieu Blondel
2012-10-24 13:26:34 UTC
Permalink
On Wed, Oct 24, 2012 at 10:00 PM, Andreas Mueller
Maybe the question was more "why does the user want to know that?"
If it is called as
RocAreaUnderCurveScore()(est, y_true, y_pred)
why does the user need to discover how this is computed?
One use case would be error-handling. For example:

[in a classifier]
def score(self, X, y, metric=None):
metric = get_metric(metric) # handle None / string
if not ininstance(metric, ClassificationMetric):
raise ValueError

or

if not metric.classication_metric:
raise ValueError

But some metrics may be used in different categories. For example, MSE is a
regression metric but you can use it to evaluate binary classification if
you really want too.
So I don't know if such error handling would be a good idea.

BTW, the way to call a metric object would be:
RocAreaUnderCurveScore()(est, X, y)

Mathieu
Olivier Grisel
2012-10-24 13:56:31 UTC
Permalink
Post by Mathieu Blondel
Maybe the question was more "why does the user want to know that?"
If it is called as
RocAreaUnderCurveScore()(est, y_true, y_pred)
why does the user need to discover how this is computed?
[in a classifier]
metric = get_metric(metric) # handle None / string
raise ValueError
or
raise ValueError
Let's do more permissive duck / flag typing using getattr instead:

if not getattr(metric, 'classification_metric', False):
raise ValueError('%s does not support classification' % metric)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Alexandre Gramfort
2012-10-24 21:17:41 UTC
Permalink
hi folks,

I apologize if this was already a discarded option above but I read quickly.

how about putting the logic in cross_val_score and GridSearchCV and just
"inspect" if the score_func to see it expects a parameter called "estimator"?
that would then mean calling the score func with (estimator, X) for unsupervised
and (estimator, X, y) for supervised.

the existing code would still work and for advanced use cases it could fly to.

my 0.2 cents

Alex
Olivier Grisel
2012-10-25 01:00:23 UTC
Permalink
Post by Alexandre Gramfort
hi folks,
I apologize if this was already a discarded option above but I read quickly.
how about putting the logic in cross_val_score and GridSearchCV and just
"inspect" if the score_func to see it expects a parameter called "estimator"?
that would then mean calling the score func with (estimator, X) for unsupervised
and (estimator, X, y) for supervised.
the existing code would still work and for advanced use cases it could fly to.
Arguments inspection would be a good way to deal with backward compat
indeed. However I would still mark the non-estimator API deprecated
with a warning with maybe +4 releases (1 year deprecation) to get rid
of arg introspection later and always require the __call__(estimator,
X, y=None) API in the long run.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Continue reading on narkive:
Loading...