Discussion:
[Scikit-learn-general] Jaccard Index
Shishir Pandey
2016-05-09 11:05:28 UTC
Permalink
I a bit confused regarding the Jaccard similarity score. The example given
on :
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html#sklearn.metrics.jaccard_similarity_score
import numpy as np>>> from sklearn.metrics import jaccard_similarity_score>>> y_pred = [0, 2, 1, 3]
I am assuming that here each dimension is a label and the entry
represents how many times that label appears. Also I am assuming that

the each entry has weight of 1.
y_true = [0, 1, 2, 3]
Then, A \intersection B (y_pred and y_true) will be = 1 + 1 + 3 = 5

and A \union B will be = 3 + 3 + 3 = 9

How is the jaccard similarity = 0.5?
jaccard_similarity_score(y_true, y_pred)0.5>>> jaccard_similarity_score(y_true, y_pred, normalize=False)2
--
sp
Maniteja Nandana
2016-05-09 11:21:45 UTC
Permalink
Hi,

If I understand it correctly, the jaccard similarity is the ratio of number
of matching outputs to the total number of outputs in case of binary and
multiclass classification. Here, the first and the last outputs are
matching among the four outputs, hence the jaccard score is 2/4=0.5.

I hope it is right and helps.

Regards,
Maniteja.

_________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
://lists.sourceforge.net/lists/
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>listinfo
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
/scikit-learn-general
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
Bharat Didwania 4-Yr B.Tech. Electrical Engg.
2016-05-09 11:41:26 UTC
Permalink
Hi,
jaccard similarity coefficient or score is the ratio of size of
intersection to the size of union of the to label sets .
In this case the size of union is 4 and that of intersection is 2 . Hence
the jaccard similarity score will be 2/4=0.5.

I hope this will help.

Regards,
Bharat.
Post by Shishir Pandey
I a bit confused regarding the Jaccard similarity score. The example
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html#sklearn.metrics.jaccard_similarity_score
import numpy as np>>> from sklearn.metrics import jaccard_similarity_score>>> y_pred = [0, 2, 1, 3]
I am assuming that here each dimension is a label and the entry
represents how many times that label appears. Also I am assuming that
the each entry has weight of 1.
y_true = [0, 1, 2, 3]
Then, A \intersection B (y_pred and y_true) will be = 1 + 1 + 3 = 5
and A \union B will be = 3 + 3 + 3 = 9
How is the jaccard similarity = 0.5?
jaccard_similarity_score(y_true, y_pred)0.5>>> jaccard_similarity_score(y_true, y_pred, normalize=False)2
--
sp
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager
Applications Manager provides deep performance insights into multiple
tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Shishir Pandey
2016-05-09 11:53:07 UTC
Permalink
This is what I am having trouble understanding. What does each dimension of
the vector represent? I am thinking of it as follows:

[label_1, label_2, ..., label_N]

a characteristic vector would be something like [1, 1, 0, ..., 1, 0, 0]

This represents weather label_i is present in the set or not? In that case
the answer would be different. A 0 is the two sets would represent that the
label is not present in either of the sets and hence the union would be
smaller than the dimension of the vector.

--
sp

On Mon, May 9, 2016 at 5:11 PM, Bharat Didwania 4-Yr B.Tech. Electrical
Post by Bharat Didwania 4-Yr B.Tech. Electrical Engg.
Hi,
jaccard similarity coefficient or score is the ratio of size of
intersection to the size of union of the to label sets .
In this case the size of union is 4 and that of intersection is 2 . Hence
the jaccard similarity score will be 2/4=0.5.
I hope this will help.
Regards,
Bharat.
Post by Shishir Pandey
I a bit confused regarding the Jaccard similarity score. The example
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html#sklearn.metrics.jaccard_similarity_score
import numpy as np>>> from sklearn.metrics import jaccard_similarity_score>>> y_pred = [0, 2, 1, 3]
I am assuming that here each dimension is a label and the entry
represents how many times that label appears. Also I am assuming that
the each entry has weight of 1.
y_true = [0, 1, 2, 3]
Then, A \intersection B (y_pred and y_true) will be = 1 + 1 + 3 = 5
and A \union B will be = 3 + 3 + 3 = 9
How is the jaccard similarity = 0.5?
jaccard_similarity_score(y_true, y_pred)0.5>>> jaccard_similarity_score(y_true, y_pred, normalize=False)2
--
sp
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager
Applications Manager provides deep performance insights into multiple
tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager
Applications Manager provides deep performance insights into multiple
tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alan Isaac
2016-05-09 13:35:04 UTC
Permalink
A 0 [in both of] the two sets would represent that the label is not present in either of the sets and hence the union would be smaller than the dimension of the vector.
Yes I agree; that would constitute a standard definition.
Alan Isaac
Maniteja Nandana
2016-05-09 14:14:13 UTC
Permalink
Post by Shishir Pandey
This is what I am having trouble understanding. What does each dimension
[label_1, label_2, ..., label_N]
a characteristic vector would be something like [1, 1, 0, ..., 1, 0, 0]
This represents weather label_i is present in the set or not? In that
case the answer would be different. A 0 is the two sets would represent
that the label is not present in either of the sets and hence the union
would be smaller than the dimension of the vector.
Post by Shishir Pandey
--
sp
Sorry if I am misunderstanding here but in case you are referring to multi
label classification here, something like an array of [[0,0],[0,1]] would
be the value of predicted outputs, but an array of [0,1,3,2] would
represent multi class output and is used in the example in discussion.
Regarding the size of union, in multiclass and binary, intersection size is
the number of times predicted class is same by total number of outputs.
Here 0 need not mean the absence of class.
But in multi label 0 means absence of label and in this case jaccard
similarity is calculated for each output and weighted mean is calculated.

In following, the first output has only one label in ground truth while two
in prediction. While in the second example has the first output which has
only one label in both ground truth and prediction.

y_true = [[0, 1], [1, 1]]
y_pred = [[1, 1], [1, 1]]
jaccard_similarity_score
0.75 #(0.5 + 1)/2

y_true = [[0, 1], [1, 1]]
y_pred = [[0, 1], [1, 1]]
jaccard_similarity_score
1#(1+1)/2

Hope it helps.

Regards,
Maniteja.
_________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
://lists.sourceforge.net/lists/
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>listinfo
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
/scikit-learn-general
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
Shishir Pandey
2016-05-09 16:15:33 UTC
Permalink
From what you are saying isn't the Jaccard distance for the multi-class
case equivalent to the (1-hammingloss). Where the hamming loss is the
average of places where the two vectors are different.

I want to understand what do your examples represent? Could you give an
example where the dimension is y is 2 x 3 because I getting confused on
what the 2 represents, is it the number of columns or number of rows?


--
sp

On Mon, May 9, 2016 at 7:44 PM, Maniteja Nandana <
Post by Shishir Pandey
This is what I am having trouble understanding. What does each dimension
[label_1, label_2, ..., label_N]
a characteristic vector would be something like [1, 1, 0, ..., 1, 0, 0]
This represents weather label_i is present in the set or not? In that
case the answer would be different. A 0 is the two sets would represent
that the label is not present in either of the sets and hence the union
would be smaller than the dimension of the vector.
Post by Shishir Pandey
--
sp
Sorry if I am misunderstanding here but in case you are referring to multi
label classification here, something like an array of [[0,0],[0,1]] would
be the value of predicted outputs, but an array of [0,1,3,2] would
represent multi class output and is used in the example in discussion.
Regarding the size of union, in multiclass and binary, intersection size
is the number of times predicted class is same by total number of outputs.
Here 0 need not mean the absence of class.
But in multi label 0 means absence of label and in this case jaccard
similarity is calculated for each output and weighted mean is calculated.
In following, the first output has only one label in ground truth while
two in prediction. While in the second example has the first output which
has only one label in both ground truth and prediction.
y_true = [[0, 1], [1, 1]]
y_pred = [[1, 1], [1, 1]]
jaccard_similarity_score
0.75 #(0.5 + 1)/2
y_true = [[0, 1], [1, 1]]
y_pred = [[0, 1], [1, 1]]
jaccard_similarity_score
1#(1+1)/2
Hope it helps.
Regards,
Maniteja.
_________________
Scikit-learn-general mailing list
https <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
://lists.sourceforge.net/lists/
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
listinfo
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
/scikit-learn-general
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager
Applications Manager provides deep performance insights into multiple
tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Maniteja Nandana
2016-05-09 16:41:20 UTC
Permalink
From what you are saying isn't the Jaccard distance for the multi-class
case equivalent to the (1-hammingloss). Where the hamming loss is the
average of places where the two vectors are different.
Yeah, from what I can understand you are right. Accuracy score, zero one
loss and hamming loss are all equivalent in this case.
I want to understand what do your examples represent? Could you give an
example where the dimension is y is 2 x 3 because I getting confused on
what the 2 represents, is it the number of columns or number of rows?
In multi label classification, prediction will be a 2D array of 0 and 1s.
It's shape is (n_outputs, n_labels). So, a 2X3 array represents 2 outputs
and 3 labels possible for each of them. A 1 represents that the label is
present for that output and 0 is otherwise.

Hence for jaccard, it sees the number of common labels across the
labels(column) in y_true and y_pred and divides it with the number of
labels present in at least one of y_true and y_pred. The weighted average
is then calculated across all outputs(rows).

So for the first example above, the first output has [0, 1] and [1, 1] as
the labels. Hence it is 1/2 =0.5 while second output has both as [1, 1]. So
it is 1. While averaged, it becomes 0.75.

PS: I am not aware of the exact reason, but in case both y_true and y_pred
are all zeros ([0, 0]) for an output, the jaccard score is taken as 1 in
the implementation.

Hope it helps.

Regards,
Maniteja.
Shishir Pandey
2016-05-11 07:12:08 UTC
Permalink
Thanks for your reply. I get it now.
The all zeros case implies that the two sets are empty. Which is a 0/0
situation. Hence, it is taken to be 1.

--
sp

On Mon, May 9, 2016 at 10:11 PM, Maniteja Nandana <
Post by Shishir Pandey
From what you are saying isn't the Jaccard distance for the multi-class
case equivalent to the (1-hammingloss). Where the hamming loss is the
average of places where the two vectors are different.
Yeah, from what I can understand you are right. Accuracy score, zero one
loss and hamming loss are all equivalent in this case.
I want to understand what do your examples represent? Could you give an
example where the dimension is y is 2 x 3 because I getting confused on
what the 2 represents, is it the number of columns or number of rows?
In multi label classification, prediction will be a 2D array of 0 and 1s.
It's shape is (n_outputs, n_labels). So, a 2X3 array represents 2 outputs
and 3 labels possible for each of them. A 1 represents that the label is
present for that output and 0 is otherwise.
Hence for jaccard, it sees the number of common labels across the
labels(column) in y_true and y_pred and divides it with the number of
labels present in at least one of y_true and y_pred. The weighted average
is then calculated across all outputs(rows).
So for the first example above, the first output has [0, 1] and [1, 1] as
the labels. Hence it is 1/2 =0.5 while second output has both as [1, 1]. So
it is 1. While averaged, it becomes 0.75.
PS: I am not aware of the exact reason, but in case both y_true and y_pred
are all zeros ([0, 0]) for an output, the jaccard score is taken as 1 in
the implementation.
Hope it helps.
Regards,
Maniteja.
________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager
Applications Manager provides deep performance insights into multiple
tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...