[Scikit-learn-general] Average Per-Class Accuracy metric

Discussion:

Sebastian Raschka

2016-03-08 00:57:10 UTC

Hi,

I was just wondering why there’s no support for the average per-class accuracy in the scorer functions (if I am not overlooking something).
E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', ‘f1_weighted’ but I didn’t see a ‘accuracy_macro’, i.e.,
(acc.class_1 + acc.class_2 + … + acc.class_n) / n

Would you discourage its usage (in favor of other metrics in imbalanced class problems) or was it simply not implemented, yet?

Best,
Sebastian

Joel Nothman

2016-03-08 23:40:30 UTC

Permalink

Hi,
I was just wondering why thereâs no support for the average per-class
accuracy in the scorer functions (if I am not overlooking something).
E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', âf1_weightedâ but I
didnât see a âaccuracy_macroâ, i.e.,
(acc.class_1 + acc.class_2 + âŠ + acc.class_n) / n
Would you discourage its usage (in favor of other metrics in imbalanced
class problems) or was it simply not implemented, yet?
Best,
Sebastian
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Sebastian Raschka

2016-03-09 00:15:42 UTC

Permalink

I haven’t seen this in practice, yet, either. A colleague was looking for this in scikit-learn recently, and he asked me if I know whether this is implemented or not. I couldn’t find anything in the docs and was just curious about your opinion. However, I just found this entry here on wikipedia:

https://en.wikipedia.org/wiki/Accuracy_and_precision

Am I right in thinking that in the binary case, this is identical to accuracy?

I think it would only be equal to the “accuracy” if the class labels are uniformly distributed.

I'm not sure what this metric is getting at.

I have to think about this more, but I think it may be useful for imbalanced datasets where you want to emphasize the minority class. E.g., let’s say we have a dataset of 120 samples and three class labels 1, 2, 3. And the classes are distributed like this
10 x 1
50 x 2
60 x 3

Now, let’s assume we have a model that makes the following predictions

- it gets 0 out of 10 from class 1 right
- 45 out of 50 from class 2
- 55 out of 60 from class 3

So, the accuracy would then be computed as

(0 + 45 + 55) / 120 = 0.833

But the “balanced accuracy” would be much lower, because the model did really badly on class 1, i.e.,

(0/10 + 45/50 + 55/60) / 3 = 0.61

Hm, if I see this correctly, this is actually very similar to the F1 score. But instead of computing the harmonic mean between “precision and the true positive rate), we compute the harmonic mean between "precision and true negative rate"

I've not seen this metric used (references?). Am I right in thinking that in the binary case, this is identical to accuracy? If I predict all elements to be the majority class, then adding more minority classes into the problem increases my score. I'm not sure what this metric is getting at.
Hi,
I was just wondering why there’s no support for the average per-class accuracy in the scorer functions (if I am not overlooking something).
E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', ‘f1_weighted’ but I didn’t see a ‘accuracy_macro’, i.e.,
(acc.class_1 + acc.class_2 + … + acc.class_n) / n
Would you discourage its usage (in favor of other metrics in imbalanced class problems) or was it simply not implemented, yet?
Best,
Sebastian
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Joel Nothman

2016-03-09 01:29:29 UTC

Permalink

Firstly, balanced accuracy is a different thing, and yes, it should be
supported.

Secondly, I am correct in thinking you're talking about multiclass (not
multilabel).

However, what you're describing isn't accuracy. It's actually
micro-averaged recall, except that your dataset is impossible because
you're allowing there to be fewer predictions than instances. If we assume
that we're allowed to predict some negative class, that's fine; we can
nowadays exclude it from micro-averaged recall with the labels parameter to
recall_score. (If all labels are included in a multiclass problem,
micro-averaged recall = precision = fscore = accuracy.)

I had assumed you meant binarised accuracy, which would add together both
true positives and true negatives for each class.

Either way, if there's no literature on this, I think we'd really best not
support it.

I havenât seen this in practice, yet, either. A colleague was looking for
this in scikit-learn recently, and he asked me if I know whether this is
implemented or not. I couldnât find anything in the docs and was just
curious about your opinion. However, I just found this entry here on
https://en.wikipedia.org/wiki/Accuracy_and_precision

Another useful performance measure is the balanced accuracy[10] which

avoids inflated performance estimates on imbalanced datasets. It is defined
as the arithmetic mean of sensitivity and specificity, or the average

Am I right in thinking that in the binary case, this is identical to

accuracy?
I think it would only be equal to the âaccuracyâ if the class labels are
uniformly distributed.

I'm not sure what this metric is getting at.

I have to think about this more, but I think it may be useful for
imbalanced datasets where you want to emphasize the minority class. E.g.,
letâs say we have a dataset of 120 samples and three class labels 1, 2, 3.
And the classes are distributed like this
10 x 1
50 x 2
60 x 3
Now, letâs assume we have a model that makes the following predictions
- it gets 0 out of 10 from class 1 right
- 45 out of 50 from class 2
- 55 out of 60 from class 3
So, the accuracy would then be computed as
(0 + 45 + 55) / 120 = 0.833
But the âbalanced accuracyâ would be much lower, because the model did
really badly on class 1, i.e.,
(0/10 + 45/50 + 55/60) / 3 = 0.61
Hm, if I see this correctly, this is actually very similar to the F1
score. But instead of computing the harmonic mean between âprecision and
the true positive rate), we compute the harmonic mean between "precision
and true negative rate"

I've not seen this metric used (references?). Am I right in thinking

that in the binary case, this is identical to accuracy? If I predict all
elements to be the majority class, then adding more minority classes into
the problem increases my score. I'm not sure what this metric is getting at.

Hi,
I was just wondering why thereâs no support for the average per-class

accuracy in the scorer functions (if I am not overlooking something).

E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', âf1_weightedâ but I

didnât see a âaccuracy_macroâ, i.e.,

(acc.class_1 + acc.class_2 + âŠ + acc.class_n) / n
Would you discourage its usage (in favor of other metrics in imbalanced

class problems) or was it simply not implemented, yet?

Best,
Sebastian

------------------------------------------------------------------------------

Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.

http://makebettercode.com/inteldaal-eval_______________________________________________

Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Joel Nothman

2016-03-09 01:29:52 UTC

Permalink

(Although multiloutput accuracy is reasonable to support.)

Post by Joel Nothman
Firstly, balanced accuracy is a different thing, and yes, it should be
supported.
Secondly, I am correct in thinking you're talking about multiclass (not
multilabel).
However, what you're describing isn't accuracy. It's actually
micro-averaged recall, except that your dataset is impossible because
you're allowing there to be fewer predictions than instances. If we assume
that we're allowed to predict some negative class, that's fine; we can
nowadays exclude it from micro-averaged recall with the labels parameter to
recall_score. (If all labels are included in a multiclass problem,
micro-averaged recall = precision = fscore = accuracy.)
I had assumed you meant binarised accuracy, which would add together both
true positives and true negatives for each class.
Either way, if there's no literature on this, I think we'd really best not
support it.

Another useful performance measure is the balanced accuracy[10] which

avoids inflated performance estimates on imbalanced datasets. It is defined
as the arithmetic mean of sensitivity and specificity, or the average

Am I right in thinking that in the binary case, this is identical to

accuracy?
I think it would only be equal to the âaccuracyâ if the class labels are
uniformly distributed.

I'm not sure what this metric is getting at.

I've not seen this metric used (references?). Am I right in thinking

Hi,
I was just wondering why thereâs no support for the average per-class

accuracy in the scorer functions (if I am not overlooking something).

E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', âf1_weightedâ but I

didnât see a âaccuracy_macroâ, i.e.,

(acc.class_1 + acc.class_2 + âŠ + acc.class_n) / n
Would you discourage its usage (in favor of other metrics in imbalanced

class problems) or was it simply not implemented, yet?

Best,
Sebastian

------------------------------------------------------------------------------

Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.

http://makebettercode.com/inteldaal-eval_______________________________________________

Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Sebastian Raschka

2016-03-09 02:36:50 UTC

Permalink

Firstly, balanced accuracy is a different thing, and yes, it should be supported.
Secondly, I am correct in thinking you're talking about multiclass (not multilabel).

Sorry for the confusion, and yes, you are right. I think have mixed the terms “average per-class accuracy” with “balanced accuracy” then.

Maybe to clarify, a corrected example to describe what I meant. Given the confusion matrix

predicted
label

[ 3, 0, 0]
true [ 7, 50, 12]
label [ 0, 0, 18]

I’d compute the accuracy as TP / TN = (3 + 50 + 18) / 90 = 0.79

and the “average per-class accuracy” as

(83/90 + 71/90 + 78/90) / 3 = (83 + 71 + 78) / (3 * 90) = 0.86

(I hope I got it right this time!)

In any case, I am not finding any literature describing this, and I am also not proposing to add it to sickit-learn, just wanted to get some info whether this is implemented or not. Thanks! :)

Firstly, balanced accuracy is a different thing, and yes, it should be supported.
Secondly, I am correct in thinking you're talking about multiclass (not multilabel).
However, what you're describing isn't accuracy. It's actually micro-averaged recall, except that your dataset is impossible because you're allowing there to be fewer predictions than instances. If we assume that we're allowed to predict some negative class, that's fine; we can nowadays exclude it from micro-averaged recall with the labels parameter to recall_score. (If all labels are included in a multiclass problem, micro-averaged recall = precision = fscore = accuracy.)
I had assumed you meant binarised accuracy, which would add together both true positives and true negatives for each class.
Either way, if there's no literature on this, I think we'd really best not support it.
https://en.wikipedia.org/wiki/Accuracy_and_precision

Am I right in thinking that in the binary case, this is identical to accuracy?

I think it would only be equal to the “accuracy” if the class labels are uniformly distributed.

I'm not sure what this metric is getting at.

I have to think about this more, but I think it may be useful for imbalanced datasets where you want to emphasize the minority class. E.g., let’s say we have a dataset of 120 samples and three class labels 1, 2, 3. And the classes are distributed like this
10 x 1
50 x 2
60 x 3
Now, let’s assume we have a model that makes the following predictions
- it gets 0 out of 10 from class 1 right
- 45 out of 50 from class 2
- 55 out of 60 from class 3
So, the accuracy would then be computed as
(0 + 45 + 55) / 120 = 0.833
But the “balanced accuracy” would be much lower, because the model did really badly on class 1, i.e.,
(0/10 + 45/50 + 55/60) / 3 = 0.61
Hm, if I see this correctly, this is actually very similar to the F1 score. But instead of computing the harmonic mean between “precision and the true positive rate), we compute the harmonic mean between "precision and true negative rate"

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Joel Nothman

2016-03-09 03:03:17 UTC

Permalink

You mean TP / N, not TP / TN.

And I think the average per-class accuracy does some weird things. Like:

true = [1, 1, 1, 0, 0]
pred = [1, 1, 1, 1, 1]
a.p.c.a = (3 + 3) / 5 / 2

true = [1, 1, 1, 0, 2]
pred = [1, 1, 1, 1, 1]
a.p.c.a = (4 + 4 + 3) / 5 / 3

I don't think that's very useful.

Post by Joel Nothman

Post by Joel Nothman
Firstly, balanced accuracy is a different thing, and yes, it should be

supported.

Post by Joel Nothman
Secondly, I am correct in thinking you're talking about multiclass (not

multilabel).
Sorry for the confusion, and yes, you are right. I think have mixed the
terms âaverage per-class accuracyâ with âbalanced accuracyâ then.
Maybe to clarify, a corrected example to describe what I meant. Given the confusion matrix
predicted
label
[ 3, 0, 0]
true [ 7, 50, 12]
label [ 0, 0, 18]
Iâd compute the accuracy as TP / TN = (3 + 50 + 18) / 90 = 0.79
and the âaverage per-class accuracyâ as
(83/90 + 71/90 + 78/90) / 3 = (83 + 71 + 78) / (3 * 90) = 0.86
(I hope I got it right this time!)
In any case, I am not finding any literature describing this, and I am
also not proposing to add it to sickit-learn, just wanted to get some info
whether this is implemented or not. Thanks! :)

Post by Joel Nothman
Firstly, balanced accuracy is a different thing, and yes, it should be

supported.

Post by Joel Nothman
Secondly, I am correct in thinking you're talking about multiclass (not

multilabel).

Post by Joel Nothman
However, what you're describing isn't accuracy. It's actually

micro-averaged recall, except that your dataset is impossible because
you're allowing there to be fewer predictions than instances. If we assume
that we're allowed to predict some negative class, that's fine; we can
nowadays exclude it from micro-averaged recall with the labels parameter to
recall_score. (If all labels are included in a multiclass problem,
micro-averaged recall = precision = fscore = accuracy.)

Post by Joel Nothman
I had assumed you meant binarised accuracy, which would add together

both true positives and true negatives for each class.

Post by Joel Nothman
Either way, if there's no literature on this, I think we'd really best

not support it.

Post by Joel Nothman
I havenât seen this in practice, yet, either. A colleague was looking

for this in scikit-learn recently, and he asked me if I know whether this
is implemented or not. I couldnât find anything in the docs and was just
curious about your opinion. However, I just found this entry here on

Post by Joel Nothman
https://en.wikipedia.org/wiki/Accuracy_and_precision

Another useful performance measure is the balanced accuracy[10] which

avoids inflated performance estimates on imbalanced datasets. It is defined
as the arithmetic mean of sensitivity and specificity, or the average

Post by Joel Nothman

Am I right in thinking that in the binary case, this is identical to

accuracy?

Post by Joel Nothman
I think it would only be equal to the âaccuracyâ if the class labels are

uniformly distributed.

Post by Joel Nothman

I'm not sure what this metric is getting at.

I have to think about this more, but I think it may be useful for

imbalanced datasets where you want to emphasize the minority class. E.g.,
letâs say we have a dataset of 120 samples and three class labels 1, 2, 3.
And the classes are distributed like this

Post by Joel Nothman
10 x 1
50 x 2
60 x 3
Now, letâs assume we have a model that makes the following predictions
- it gets 0 out of 10 from class 1 right
- 45 out of 50 from class 2
- 55 out of 60 from class 3
So, the accuracy would then be computed as
(0 + 45 + 55) / 120 = 0.833
But the âbalanced accuracyâ would be much lower, because the model did

really badly on class 1, i.e.,

Post by Joel Nothman
(0/10 + 45/50 + 55/60) / 3 = 0.61
Hm, if I see this correctly, this is actually very similar to the F1

score. But instead of computing the harmonic mean between âprecision and
the true positive rate), we compute the harmonic mean between "precision
and true negative rate"

Post by Joel Nothman

I've not seen this metric used (references?). Am I right in thinking

Post by Joel Nothman

Hi,
I was just wondering why thereâs no support for the average per-class

accuracy in the scorer functions (if I am not overlooking something).

Post by Joel Nothman

E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', âf1_weightedâ but

I didnât see a âaccuracy_macroâ, i.e.,

Post by Joel Nothman

(acc.class_1 + acc.class_2 + âŠ + acc.class_n) / n
Would you discourage its usage (in favor of other metrics in

imbalanced class problems) or was it simply not implemented, yet?

Post by Joel Nothman

Best,
Sebastian

------------------------------------------------------------------------------

Post by Joel Nothman

------------------------------------------------------------------------------

Post by Joel Nothman

Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.

http://makebettercode.com/inteldaal-eval_______________________________________________

Post by Joel Nothman

Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Joel Nothman
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Joel Nothman
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.

http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140_______________________________________________

Post by Joel Nothman
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general