Discussion:
[Scikit-learn-general] pairwise metrics / distances functions
Alexandre Gramfort
2010-12-12 19:30:32 UTC
Permalink
Hi,

the question has been raised many time and I think it's time to
address it in a common way.

We regularly need to compute the pairwise distances / metrics
between two sets of samples in the same space. See for example
in the affinity_propagation, in the manifold module, in the gaussian
process module etc...

Shall we create :

scikits.learn.pairwise_metrics

and then

from scikits.learn.pairwise_metrics import euclidian_distances
D = euclidian_distances(X, Y)

where D is a symmetric matrix D[i,j] = linalg.norm(X[i] - Y[j])

what do you think?

Alex
Matthieu Brucher
2010-12-12 19:35:26 UTC
Permalink
+1

I may suggest to put it inside utils though.

Matthieu
Post by Alexandre Gramfort
Hi,
the question has been raised many time and I think it's time to
address it in a common way.
We regularly need to compute the pairwise distances / metrics
between two sets of samples in the same space. See for example
in the affinity_propagation, in the manifold module, in the gaussian
process module etc...
scikits.learn.pairwise_metrics
and then
from scikits.learn.pairwise_metrics import euclidian_distances
D = euclidian_distances(X, Y)
where D is a symmetric matrix D[i,j] = linalg.norm(X[i] - Y[j])
what do you think?
Alex
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Gael Varoquaux
2010-12-12 20:00:02 UTC
Permalink
Post by Matthieu Brucher
+1
I may suggest to put it inside utils though.
I'd rather not have it under utils: I'd like utils to be only things that
are off no interest to the end user, but are used inside the scikit.

I'd like metrics to be a sub-package, and pairwise a module in this
subpackage.

My 2 cents,

Gaël
Alexandre Gramfort
2010-12-12 20:02:55 UTC
Permalink
Post by Gael Varoquaux
Post by Matthieu Brucher
I may suggest to put it inside utils though.
I'd rather not have it under utils: I'd like utils to be only things that
are off no interest to the end user, but are used inside the scikit.
I'd like metrics to be a sub-package, and pairwise a module in this
subpackage.
My 2 cents,
+1 good idea

Alex
Vincent Michel
2010-12-12 20:04:38 UTC
Permalink
+1

Talking about distances and measures, do you think that this module
could be the place
for functions creating affinity matrices (following the example of
Knn-based affinity
matrix created by Alexandre) ?

Vincent
Post by Gael Varoquaux
Post by Matthieu Brucher
+1
I may suggest to put it inside utils though.
I'd rather not have it under utils: I'd like utils to be only things that
are off no interest to the end user, but are used inside the scikit.
I'd like metrics to be a sub-package, and pairwise a module in this
subpackage.
My 2 cents,
Gaël
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alexandre Gramfort
2010-12-12 20:08:44 UTC
Permalink
my Knn-based affinity matrix will definitely use the pairwise module
but I would leave it the neighbors module

my 2 Bostonian cents,

Alex
Post by Vincent Michel
+1
Talking about distances and measures, do you think that this module
could be the place
for functions creating affinity matrices (following the example of
Knn-based affinity
matrix created by Alexandre) ?
Vincent
Post by Gael Varoquaux
Post by Matthieu Brucher
+1
I may suggest to put it inside utils though.
I'd rather not have it under utils: I'd like utils to be only things that
are off no interest to the end user, but are used inside the scikit.
I'd like metrics to be a sub-package, and pairwise a module in this
subpackage.
My 2 cents,
Gaël
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2010-12-12 20:10:38 UTC
Permalink
Post by Vincent Michel
+1
Talking about distances and measures, do you think that this module
could be the place
for functions creating affinity matrices (following the example of
Knn-based affinity
matrix created by Alexandre) ?
I don't see an affinity matrix as a 'metric'. <pedantic>A metric is for me the
generalized notion of a distance in a non Euclidian space</pedantic>. In
other words, I except to have in 'metrics' function that return real
valued numbers.

Now, I agree that the concepts are related, so if you can find a good
name for a subpackage grouping them, I am +1 on your proposal.

Gaël
Vincent Michel
2010-12-12 20:33:54 UTC
Permalink
Post by Gael Varoquaux
Post by Vincent Michel
+1
Talking about distances and measures, do you think that this module
could be the place
for functions creating affinity matrices (following the example of
Knn-based affinity
matrix created by Alexandre) ?
I don't see an affinity matrix as a 'metric'. <pedantic>A metric is for me the
generalized notion of a distance in a non Euclidian space</pedantic>. In
other words, I except to have in 'metrics' function that return real
valued numbers
I agree with this. I was just proposing that, as the two subjects are
clearly related.
Post by Gael Varoquaux
Now, I agree that the concepts are related, so if you can find a good
name for a subpackage grouping them, I am +1 on your proposal.
will think about it
Post by Gael Varoquaux
Gaël
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Matthieu Brucher
2010-12-12 20:33:39 UTC
Permalink
Post by Gael Varoquaux
Post by Vincent Michel
+1
Talking about distances and measures, do you think that this module
could be the place
for functions creating affinity matrices (following the example of
Knn-based affinity
matrix created by Alexandre) ?
I don't see an affinity matrix as a 'metric'. <pedantic>A metric is for me the
generalized notion of a distance in a non Euclidian space</pedantic>. In
other words, I except to have in 'metrics' function that return real
valued numbers.
Now, I agree that the concepts are related, so if you can find a good
name for a subpackage grouping them, I am +1 on your proposal.
In the manifold module, the affinity/similarity is the inverse of the
distance/metrics, so they are pretty much related, yes ;)
Ths is why I can understand Vincent's proposal as well. And it is also
why I offered the utils package, as everything graph related is in
this package (and as for metrics, they can be of use for the end user
as well).

Matthieu
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Gael Varoquaux
2010-12-12 20:58:13 UTC
Permalink
Post by Matthieu Brucher
In the manifold module, the affinity/similarity is the inverse of the
distance/metrics, so they are pretty much related, yes ;)
Ths is why I can understand Vincent's proposal as well. And it is also
why I offered the utils package, as everything graph related is in
this package (and as for metrics, they can be of use for the end user
as well).
My issue with 'utils' is that it is a non-descriptive name. Nobody ever
looks in utils when searching for something, or only after looking
everywhere else.

Naming things is hard, but we should not give up on it.

G
Alexandre Gramfort
2010-12-12 21:18:45 UTC
Permalink
just pushed scikits.learn.metrics.pairwise

nothing is definitive, but I think it's a first rather good solution.

Alex

On Sun, Dec 12, 2010 at 3:58 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Matthieu Brucher
In the manifold module, the affinity/similarity is the inverse of the
distance/metrics, so they are pretty much related, yes ;)
Ths is why I can understand Vincent's proposal as well. And it is also
why I offered the utils package, as everything graph related is in
this package (and as for metrics, they can be of use for the end user
as well).
My issue with 'utils' is that it is a non-descriptive name. Nobody ever
looks in utils when searching for something, or only after looking
everywhere else.
Naming things is hard, but we should not give up on it.
G
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2010-12-12 21:35:46 UTC
Permalink
Post by Alexandre Gramfort
just pushed scikits.learn.metrics.pairwise
nothing is definitive, but I think it's a first rather good solution.
On a related subject here is a bunch of code to evaluate
(qualitatively, or quantitatively) the quality of an embedding:

The first is the accuracy at rank k using knn queries:

https://github.com/ogrisel/codemaker/blob/master/src/codemaker/evaluation.py#L31

The other is a scatter plot of the pairwise distances between randomly
sampled points in the original space and the mapped points according
to the embedding:

https://github.com/ogrisel/codemaker/blob/master/src/codemaker/evaluation.py#L50

It also computes the correlation of the sampled pairwise correlations.

I wonder if those similar utilities could be part of the new
scikits.learn.metrics.pairwise module (or maybe in another).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Matthieu Brucher
2010-12-12 21:42:23 UTC
Permalink
Post by Olivier Grisel
Post by Alexandre Gramfort
just pushed scikits.learn.metrics.pairwise
nothing is definitive, but I think it's a first rather good solution.
On a related subject here is a bunch of code to evaluate
https://github.com/ogrisel/codemaker/blob/master/src/codemaker/evaluation.py#L31
Good one, I use something like this for my manifold tests
Post by Olivier Grisel
The other is a scatter plot of the pairwise distances between randomly
sampled points in the original space and the mapped points according
 https://github.com/ogrisel/codemaker/blob/master/src/codemaker/evaluation.py#L50
It also computes the correlation of the sampled pairwise correlations.
I wonder if those similar utilities could be part of the new
scikits.learn.metrics.pairwise module (or maybe in another).
Indeed, they could be of interest.

On a side note, I didn't have time to finish the manifold-light
branch, I hope to have more time during the holidays (which I don't
have, but that's another story :|). As all these topics are related to
manifold learning, it's an update to the current situation ;)

Matthieu
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Mathieu Blondel
2010-12-13 07:18:50 UTC
Permalink
Hello,

Sorry I don't have time to read the thread but I just want to point
out that scipy has some support for distance metrics in:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html

This is wrapped C-code and it supports outputting the results in
square form or packed form (without the diagonal elements though).

I think it supports dense matrices only though (I'm mostly interested
in sparse matrices).

On a related note, it would be nice to have reusable dense and sparse
modules to compute kernel (gram) matrices for well-known kernels. This
is useful for using in estimators which have an option
kernel="precomputed". I have code for the sparse case but this is C
code wrapped in Cython (I know you guys prefer all Cython).

Mathieu