Discussion:
[Scikit-learn-general] Average Per-Class Accuracy metric
Sebastian Raschka
2016-03-08 00:57:10 UTC
Permalink
Hi,

I was just wondering why there’s no support for the average per-class accuracy in the scorer functions (if I am not overlooking something).
E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', ‘f1_weighted’ but I didn’t see a ‘accuracy_macro’, i.e.,
(acc.class_1 + acc.class_2 + … + acc.class_n) / n

Would you discourage its usage (in favor of other metrics in imbalanced class problems) or was it simply not implemented, yet?

Best,
Sebastian
Joel Nothman
2016-03-08 23:40:30 UTC
Permalink
I've not seen this metric used (references?). Am I right in thinking that
in the binary case, this is identical to accuracy? If I predict all
elements to be the majority class, then adding more minority classes into
the problem increases my score. I'm not sure what this metric is getting at.
Hi,
I was just wondering why there’s no support for the average per-class
accuracy in the scorer functions (if I am not overlooking something).
E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', ‘f1_weighted’ but I
didn’t see a ‘accuracy_macro’, i.e.,
(acc.class_1 + acc.class_2 + 
 + acc.class_n) / n
Would you discourage its usage (in favor of other metrics in imbalanced
class problems) or was it simply not implemented, yet?
Best,
Sebastian
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Sebastian Raschka
2016-03-09 00:15:42 UTC
Permalink
I haven’t seen this in practice, yet, either. A colleague was looking for this in scikit-learn recently, and he asked me if I know whether this is implemented or not. I couldn’t find anything in the docs and was just curious about your opinion. However, I just found this entry here on wikipedia:

https://en.wikipedia.org/wiki/Accuracy_and_precision
Am I right in thinking that in the binary case, this is identical to accuracy?
I think it would only be equal to the “accuracy” if the class labels are uniformly distributed.
I'm not sure what this metric is getting at.
I have to think about this more, but I think it may be useful for imbalanced datasets where you want to emphasize the minority class. E.g., let’s say we have a dataset of 120 samples and three class labels 1, 2, 3. And the classes are distributed like this
10 x 1
50 x 2
60 x 3

Now, let’s assume we have a model that makes the following predictions

- it gets 0 out of 10 from class 1 right
- 45 out of 50 from class 2
- 55 out of 60 from class 3

So, the accuracy would then be computed as

(0 + 45 + 55) / 120 = 0.833

But the “balanced accuracy” would be much lower, because the model did really badly on class 1, i.e.,

(0/10 + 45/50 + 55/60) / 3 = 0.61

Hm, if I see this correctly, this is actually very similar to the F1 score. But instead of computing the harmonic mean between “precision and the true positive rate), we compute the harmonic mean between "precision and true negative rate"
I've not seen this metric used (references?). Am I right in thinking that in the binary case, this is identical to accuracy? If I predict all elements to be the majority class, then adding more minority classes into the problem increases my score. I'm not sure what this metric is getting at.
Hi,
I was just wondering why there’s no support for the average per-class accuracy in the scorer functions (if I am not overlooking something).
E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', ‘f1_weighted’ but I didn’t see a ‘accuracy_macro’, i.e.,
(acc.class_1 + acc.class_2 + … + acc.class_n) / n
Would you discourage its usage (in favor of other metrics in imbalanced class problems) or was it simply not implemented, yet?
Best,
Sebastian
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Joel Nothman
2016-03-09 01:29:29 UTC
Permalink
Firstly, balanced accuracy is a different thing, and yes, it should be
supported.

Secondly, I am correct in thinking you're talking about multiclass (not
multilabel).

However, what you're describing isn't accuracy. It's actually
micro-averaged recall, except that your dataset is impossible because
you're allowing there to be fewer predictions than instances. If we assume
that we're allowed to predict some negative class, that's fine; we can
nowadays exclude it from micro-averaged recall with the labels parameter to
recall_score. (If all labels are included in a multiclass problem,
micro-averaged recall = precision = fscore = accuracy.)

I had assumed you meant binarised accuracy, which would add together both
true positives and true negatives for each class.

Either way, if there's no literature on this, I think we'd really best not
support it.
I haven’t seen this in practice, yet, either. A colleague was looking for
this in scikit-learn recently, and he asked me if I know whether this is
implemented or not. I couldn’t find anything in the docs and was just
curious about your opinion. However, I just found this entry here on
https://en.wikipedia.org/wiki/Accuracy_and_precision
Another useful performance measure is the balanced accuracy[10] which
avoids inflated performance estimates on imbalanced datasets. It is defined
as the arithmetic mean of sensitivity and specificity, or the average
Am I right in thinking that in the binary case, this is identical to
accuracy?
I think it would only be equal to the “accuracy” if the class labels are
uniformly distributed.
I'm not sure what this metric is getting at.
I have to think about this more, but I think it may be useful for
imbalanced datasets where you want to emphasize the minority class. E.g.,
let’s say we have a dataset of 120 samples and three class labels 1, 2, 3.
And the classes are distributed like this
10 x 1
50 x 2
60 x 3
Now, let’s assume we have a model that makes the following predictions
- it gets 0 out of 10 from class 1 right
- 45 out of 50 from class 2
- 55 out of 60 from class 3
So, the accuracy would then be computed as
(0 + 45 + 55) / 120 = 0.833
But the “balanced accuracy” would be much lower, because the model did
really badly on class 1, i.e.,
(0/10 + 45/50 + 55/60) / 3 = 0.61
Hm, if I see this correctly, this is actually very similar to the F1
score. But instead of computing the harmonic mean between “precision and
the true positive rate), we compute the harmonic mean between "precision
and true negative rate"
I've not seen this metric used (references?). Am I right in thinking
that in the binary case, this is identical to accuracy? If I predict all
elements to be the majority class, then adding more minority classes into
the problem increases my score. I'm not sure what this metric is getting at.
Hi,
I was just wondering why there’s no support for the average per-class
accuracy in the scorer functions (if I am not overlooking something).
E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', ‘f1_weighted’ but I
didn’t see a ‘accuracy_macro’, i.e.,
(acc.class_1 + acc.class_2 + 
 + acc.class_n) / n
Would you discourage its usage (in favor of other metrics in imbalanced
class problems) or was it simply not implemented, yet?
Best,
Sebastian
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Joel Nothman
2016-03-09 01:29:52 UTC
Permalink
(Although multiloutput accuracy is reasonable to support.)
Post by Joel Nothman
Firstly, balanced accuracy is a different thing, and yes, it should be
supported.
Secondly, I am correct in thinking you're talking about multiclass (not
multilabel).
However, what you're describing isn't accuracy. It's actually
micro-averaged recall, except that your dataset is impossible because
you're allowing there to be fewer predictions than instances. If we assume
that we're allowed to predict some negative class, that's fine; we can
nowadays exclude it from micro-averaged recall with the labels parameter to
recall_score. (If all labels are included in a multiclass problem,
micro-averaged recall = precision = fscore = accuracy.)
I had assumed you meant binarised accuracy, which would add together both
true positives and true negatives for each class.
Either way, if there's no literature on this, I think we'd really best not
support it.
I haven’t seen this in practice, yet, either. A colleague was looking for
this in scikit-learn recently, and he asked me if I know whether this is
implemented or not. I couldn’t find anything in the docs and was just
curious about your opinion. However, I just found this entry here on
https://en.wikipedia.org/wiki/Accuracy_and_precision
Another useful performance measure is the balanced accuracy[10] which
avoids inflated performance estimates on imbalanced datasets. It is defined
as the arithmetic mean of sensitivity and specificity, or the average
Am I right in thinking that in the binary case, this is identical to
accuracy?
I think it would only be equal to the “accuracy” if the class labels are
uniformly distributed.
I'm not sure what this metric is getting at.
I have to think about this more, but I think it may be useful for
imbalanced datasets where you want to emphasize the minority class. E.g.,
let’s say we have a dataset of 120 samples and three class labels 1, 2, 3.
And the classes are distributed like this
10 x 1
50 x 2
60 x 3
Now, let’s assume we have a model that makes the following predictions
- it gets 0 out of 10 from class 1 right
- 45 out of 50 from class 2
- 55 out of 60 from class 3
So, the accuracy would then be computed as
(0 + 45 + 55) / 120 = 0.833
But the “balanced accuracy” would be much lower, because the model did
really badly on class 1, i.e.,
(0/10 + 45/50 + 55/60) / 3 = 0.61
Hm, if I see this correctly, this is actually very similar to the F1
score. But instead of computing the harmonic mean between “precision and
the true positive rate), we compute the harmonic mean between "precision
and true negative rate"
I've not seen this metric used (references?). Am I right in thinking
that in the binary case, this is identical to accuracy? If I predict all
elements to be the majority class, then adding more minority classes into
the problem increases my score. I'm not sure what this metric is getting at.
Hi,
I was just wondering why there’s no support for the average per-class
accuracy in the scorer functions (if I am not overlooking something).
E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', ‘f1_weighted’ but I
didn’t see a ‘accuracy_macro’, i.e.,
(acc.class_1 + acc.class_2 + 
 + acc.class_n) / n
Would you discourage its usage (in favor of other metrics in imbalanced
class problems) or was it simply not implemented, yet?
Best,
Sebastian
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Sebastian Raschka
2016-03-09 02:36:50 UTC
Permalink
Firstly, balanced accuracy is a different thing, and yes, it should be supported.
Secondly, I am correct in thinking you're talking about multiclass (not multilabel).
Sorry for the confusion, and yes, you are right. I think have mixed the terms “average per-class accuracy” with “balanced accuracy” then.

Maybe to clarify, a corrected example to describe what I meant. Given the confusion matrix

predicted
label

[ 3, 0, 0]
true [ 7, 50, 12]
label [ 0, 0, 18]


I’d compute the accuracy as TP / TN = (3 + 50 + 18) / 90 = 0.79

and the “average per-class accuracy” as

(83/90 + 71/90 + 78/90) / 3 = (83 + 71 + 78) / (3 * 90) = 0.86

(I hope I got it right this time!)

In any case, I am not finding any literature describing this, and I am also not proposing to add it to sickit-learn, just wanted to get some info whether this is implemented or not. Thanks! :)
Firstly, balanced accuracy is a different thing, and yes, it should be supported.
Secondly, I am correct in thinking you're talking about multiclass (not multilabel).
However, what you're describing isn't accuracy. It's actually micro-averaged recall, except that your dataset is impossible because you're allowing there to be fewer predictions than instances. If we assume that we're allowed to predict some negative class, that's fine; we can nowadays exclude it from micro-averaged recall with the labels parameter to recall_score. (If all labels are included in a multiclass problem, micro-averaged recall = precision = fscore = accuracy.)
I had assumed you meant binarised accuracy, which would add together both true positives and true negatives for each class.
Either way, if there's no literature on this, I think we'd really best not support it.
https://en.wikipedia.org/wiki/Accuracy_and_precision
Am I right in thinking that in the binary case, this is identical to accuracy?
I think it would only be equal to the “accuracy” if the class labels are uniformly distributed.
I'm not sure what this metric is getting at.
I have to think about this more, but I think it may be useful for imbalanced datasets where you want to emphasize the minority class. E.g., let’s say we have a dataset of 120 samples and three class labels 1, 2, 3. And the classes are distributed like this
10 x 1
50 x 2
60 x 3
Now, let’s assume we have a model that makes the following predictions
- it gets 0 out of 10 from class 1 right
- 45 out of 50 from class 2
- 55 out of 60 from class 3
So, the accuracy would then be computed as
(0 + 45 + 55) / 120 = 0.833
But the “balanced accuracy” would be much lower, because the model did really badly on class 1, i.e.,
(0/10 + 45/50 + 55/60) / 3 = 0.61
Hm, if I see this correctly, this is actually very similar to the F1 score. But instead of computing the harmonic mean between “precision and the true positive rate), we compute the harmonic mean between "precision and true negative rate"
I've not seen this metric used (references?). Am I right in thinking that in the binary case, this is identical to accuracy? If I predict all elements to be the majority class, then adding more minority classes into the problem increases my score. I'm not sure what this metric is getting at.
Hi,
I was just wondering why there’s no support for the average per-class accuracy in the scorer functions (if I am not overlooking something).
E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', ‘f1_weighted’ but I didn’t see a ‘accuracy_macro’, i.e.,
(acc.class_1 + acc.class_2 + … + acc.class_n) / n
Would you discourage its usage (in favor of other metrics in imbalanced class problems) or was it simply not implemented, yet?
Best,
Sebastian
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Joel Nothman
2016-03-09 03:03:17 UTC
Permalink
You mean TP / N, not TP / TN.

And I think the average per-class accuracy does some weird things. Like:

true = [1, 1, 1, 0, 0]
pred = [1, 1, 1, 1, 1]
a.p.c.a = (3 + 3) / 5 / 2

true = [1, 1, 1, 0, 2]
pred = [1, 1, 1, 1, 1]
a.p.c.a = (4 + 4 + 3) / 5 / 3

I don't think that's very useful.
Post by Joel Nothman
Post by Joel Nothman
Firstly, balanced accuracy is a different thing, and yes, it should be
supported.
Post by Joel Nothman
Secondly, I am correct in thinking you're talking about multiclass (not
multilabel).
Sorry for the confusion, and yes, you are right. I think have mixed the
terms “average per-class accuracy” with “balanced accuracy” then.
Maybe to clarify, a corrected example to describe what I meant. Given the confusion matrix
predicted
label
[ 3, 0, 0]
true [ 7, 50, 12]
label [ 0, 0, 18]
I’d compute the accuracy as TP / TN = (3 + 50 + 18) / 90 = 0.79
and the “average per-class accuracy” as
(83/90 + 71/90 + 78/90) / 3 = (83 + 71 + 78) / (3 * 90) = 0.86
(I hope I got it right this time!)
In any case, I am not finding any literature describing this, and I am
also not proposing to add it to sickit-learn, just wanted to get some info
whether this is implemented or not. Thanks! :)
Post by Joel Nothman
Firstly, balanced accuracy is a different thing, and yes, it should be
supported.
Post by Joel Nothman
Secondly, I am correct in thinking you're talking about multiclass (not
multilabel).
Post by Joel Nothman
However, what you're describing isn't accuracy. It's actually
micro-averaged recall, except that your dataset is impossible because
you're allowing there to be fewer predictions than instances. If we assume
that we're allowed to predict some negative class, that's fine; we can
nowadays exclude it from micro-averaged recall with the labels parameter to
recall_score. (If all labels are included in a multiclass problem,
micro-averaged recall = precision = fscore = accuracy.)
Post by Joel Nothman
I had assumed you meant binarised accuracy, which would add together
both true positives and true negatives for each class.
Post by Joel Nothman
Either way, if there's no literature on this, I think we'd really best
not support it.
Post by Joel Nothman
I haven’t seen this in practice, yet, either. A colleague was looking
for this in scikit-learn recently, and he asked me if I know whether this
is implemented or not. I couldn’t find anything in the docs and was just
curious about your opinion. However, I just found this entry here on
Post by Joel Nothman
https://en.wikipedia.org/wiki/Accuracy_and_precision
Another useful performance measure is the balanced accuracy[10] which
avoids inflated performance estimates on imbalanced datasets. It is defined
as the arithmetic mean of sensitivity and specificity, or the average
Post by Joel Nothman
Am I right in thinking that in the binary case, this is identical to
accuracy?
Post by Joel Nothman
I think it would only be equal to the “accuracy” if the class labels are
uniformly distributed.
Post by Joel Nothman
I'm not sure what this metric is getting at.
I have to think about this more, but I think it may be useful for
imbalanced datasets where you want to emphasize the minority class. E.g.,
let’s say we have a dataset of 120 samples and three class labels 1, 2, 3.
And the classes are distributed like this
Post by Joel Nothman
10 x 1
50 x 2
60 x 3
Now, let’s assume we have a model that makes the following predictions
- it gets 0 out of 10 from class 1 right
- 45 out of 50 from class 2
- 55 out of 60 from class 3
So, the accuracy would then be computed as
(0 + 45 + 55) / 120 = 0.833
But the “balanced accuracy” would be much lower, because the model did
really badly on class 1, i.e.,
Post by Joel Nothman
(0/10 + 45/50 + 55/60) / 3 = 0.61
Hm, if I see this correctly, this is actually very similar to the F1
score. But instead of computing the harmonic mean between “precision and
the true positive rate), we compute the harmonic mean between "precision
and true negative rate"
Post by Joel Nothman
I've not seen this metric used (references?). Am I right in thinking
that in the binary case, this is identical to accuracy? If I predict all
elements to be the majority class, then adding more minority classes into
the problem increases my score. I'm not sure what this metric is getting at.
Post by Joel Nothman
Hi,
I was just wondering why there’s no support for the average per-class
accuracy in the scorer functions (if I am not overlooking something).
Post by Joel Nothman
E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', ‘f1_weighted’ but
I didn’t see a ‘accuracy_macro’, i.e.,
Post by Joel Nothman
(acc.class_1 + acc.class_2 + 
 + acc.class_n) / n
Would you discourage its usage (in favor of other metrics in
imbalanced class problems) or was it simply not implemented, yet?
Post by Joel Nothman
Best,
Sebastian
------------------------------------------------------------------------------
Post by Joel Nothman
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Joel Nothman
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval_______________________________________________
Post by Joel Nothman
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Joel Nothman
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Joel Nothman
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140_______________________________________________
Post by Joel Nothman
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...