Discussion:
pairwise metrics / distances functions
Alexandre Gramfort
2010-12-12 19:30:32 UTC
Hi,

the question has been raised many time and I think it's time to
address it in a common way.

We regularly need to compute the pairwise distances / metrics
between two sets of samples in the same space. See for example
in the affinity_propagation, in the manifold module, in the gaussian
process module etc...

Shall we create :

scikits.learn.pairwise_metrics

and then

from scikits.learn.pairwise_metrics import euclidian_distances
D = euclidian_distances(X, Y)

where D is a symmetric matrix D[i,j] = linalg.norm(X[i] - Y[j])

what do you think?

Alex
Matthieu Brucher
2010-12-12 19:35:26 UTC
+1

I may suggest to put it inside utils though.

Matthieu
Post by Alexandre Gramfort
Hi,
the question has been raised many time and I think it's time to
address it in a common way.
We regularly need to compute the pairwise distances / metrics
between two sets of samples in the same space. See for example
in the affinity_propagation, in the manifold module, in the gaussian
process module etc...
scikits.learn.pairwise_metrics
and then
from scikits.learn.pairwise_metrics import euclidian_distances
D = euclidian_distances(X, Y)
where D is a symmetric matrix D[i,j] = linalg.norm(X[i] - Y[j])
what do you think?
Alex
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
Gael Varoquaux
2010-12-12 20:00:02 UTC
Post by Matthieu Brucher
+1
I may suggest to put it inside utils though.
I'd rather not have it under utils: I'd like utils to be only things that
are off no interest to the end user, but are used inside the scikit.

I'd like metrics to be a sub-package, and pairwise a module in this
subpackage.

My 2 cents,

Gaël
Alexandre Gramfort
2010-12-12 20:02:55 UTC
Post by Gael Varoquaux
Post by Matthieu Brucher
I may suggest to put it inside utils though.
I'd rather not have it under utils: I'd like utils to be only things that
are off no interest to the end user, but are used inside the scikit.
I'd like metrics to be a sub-package, and pairwise a module in this
subpackage.
My 2 cents,
+1 good idea

Alex
Vincent Michel
2010-12-12 20:04:38 UTC
+1

Talking about distances and measures, do you think that this module
could be the place
for functions creating affinity matrices (following the example of
Knn-based affinity
matrix created by Alexandre) ?

Vincent
Post by Gael Varoquaux
Post by Matthieu Brucher
+1
I may suggest to put it inside utils though.
I'd rather not have it under utils: I'd like utils to be only things that
are off no interest to the end user, but are used inside the scikit.
I'd like metrics to be a sub-package, and pairwise a module in this
subpackage.
My 2 cents,
Gaël
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alexandre Gramfort
2010-12-12 20:08:44 UTC
my Knn-based affinity matrix will definitely use the pairwise module
but I would leave it the neighbors module

my 2 Bostonian cents,

Alex
Post by Vincent Michel
+1
Talking about distances and measures, do you think that this module
could be the place
for functions creating affinity matrices (following the example of
Knn-based affinity
matrix created by Alexandre) ?
Vincent
Post by Gael Varoquaux
Post by Matthieu Brucher
+1
I may suggest to put it inside utils though.
I'd rather not have it under utils: I'd like utils to be only things that
are off no interest to the end user, but are used inside the scikit.
I'd like metrics to be a sub-package, and pairwise a module in this
subpackage.
My 2 cents,
Gaël
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2010-12-12 20:10:38 UTC
Post by Vincent Michel
+1
Talking about distances and measures, do you think that this module
could be the place
for functions creating affinity matrices (following the example of
Knn-based affinity
matrix created by Alexandre) ?
I don't see an affinity matrix as a 'metric'. <pedantic>A metric is for me the
generalized notion of a distance in a non Euclidian space</pedantic>. In
other words, I except to have in 'metrics' function that return real
valued numbers.

Now, I agree that the concepts are related, so if you can find a good
name for a subpackage grouping them, I am +1 on your proposal.

Gaël
Vincent Michel
2010-12-12 20:33:54 UTC
Post by Gael Varoquaux
Post by Vincent Michel
+1
Talking about distances and measures, do you think that this module
could be the place
for functions creating affinity matrices (following the example of
Knn-based affinity
matrix created by Alexandre) ?
I don't see an affinity matrix as a 'metric'. <pedantic>A metric is for me the
generalized notion of a distance in a non Euclidian space</pedantic>. In
other words, I except to have in 'metrics' function that return real
valued numbers
I agree with this. I was just proposing that, as the two subjects are
clearly related.
Post by Gael Varoquaux
Now, I agree that the concepts are related, so if you can find a good
name for a subpackage grouping them, I am +1 on your proposal.
Post by Gael Varoquaux
Gaël
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Matthieu Brucher
2010-12-12 20:33:39 UTC
Post by Gael Varoquaux
Post by Vincent Michel
+1
Talking about distances and measures, do you think that this module
could be the place
for functions creating affinity matrices (following the example of
Knn-based affinity
matrix created by Alexandre) ?
I don't see an affinity matrix as a 'metric'. <pedantic>A metric is for me the
generalized notion of a distance in a non Euclidian space</pedantic>. In
other words, I except to have in 'metrics' function that return real
valued numbers.
Now, I agree that the concepts are related, so if you can find a good
name for a subpackage grouping them, I am +1 on your proposal.
In the manifold module, the affinity/similarity is the inverse of the
distance/metrics, so they are pretty much related, yes ;)
Ths is why I can understand Vincent's proposal as well. And it is also
why I offered the utils package, as everything graph related is in
this package (and as for metrics, they can be of use for the end user
as well).

Matthieu
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
Gael Varoquaux
2010-12-12 20:58:13 UTC
Post by Matthieu Brucher
In the manifold module, the affinity/similarity is the inverse of the
distance/metrics, so they are pretty much related, yes ;)
Ths is why I can understand Vincent's proposal as well. And it is also
why I offered the utils package, as everything graph related is in
this package (and as for metrics, they can be of use for the end user
as well).
My issue with 'utils' is that it is a non-descriptive name. Nobody ever
looks in utils when searching for something, or only after looking
everywhere else.

Naming things is hard, but we should not give up on it.

G
Alexandre Gramfort
2010-12-12 21:18:45 UTC
just pushed scikits.learn.metrics.pairwise

nothing is definitive, but I think it's a first rather good solution.

Alex

On Sun, Dec 12, 2010 at 3:58 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Matthieu Brucher
In the manifold module, the affinity/similarity is the inverse of the
distance/metrics, so they are pretty much related, yes ;)
Ths is why I can understand Vincent's proposal as well. And it is also
why I offered the utils package, as everything graph related is in
this package (and as for metrics, they can be of use for the end user
as well).
My issue with 'utils' is that it is a non-descriptive name. Nobody ever
looks in utils when searching for something, or only after looking
everywhere else.
Naming things is hard, but we should not give up on it.
G
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2010-12-12 21:35:46 UTC
Post by Alexandre Gramfort
just pushed scikits.learn.metrics.pairwise
nothing is definitive, but I think it's a first rather good solution.
On a related subject here is a bunch of code to evaluate
(qualitatively, or quantitatively) the quality of an embedding:

The first is the accuracy at rank k using knn queries:

https://github.com/ogrisel/codemaker/blob/master/src/codemaker/evaluation.py#L31

The other is a scatter plot of the pairwise distances between randomly
sampled points in the original space and the mapped points according
to the embedding:

https://github.com/ogrisel/codemaker/blob/master/src/codemaker/evaluation.py#L50

It also computes the correlation of the sampled pairwise correlations.

I wonder if those similar utilities could be part of the new
scikits.learn.metrics.pairwise module (or maybe in another).
--
Olivier
Matthieu Brucher
2010-12-12 21:42:23 UTC
Post by Olivier Grisel
Post by Alexandre Gramfort
just pushed scikits.learn.metrics.pairwise
nothing is definitive, but I think it's a first rather good solution.
On a related subject here is a bunch of code to evaluate
https://github.com/ogrisel/codemaker/blob/master/src/codemaker/evaluation.py#L31
Good one, I use something like this for my manifold tests
Post by Olivier Grisel
The other is a scatter plot of the pairwise distances between randomly
sampled points in the original space and the mapped points according
https://github.com/ogrisel/codemaker/blob/master/src/codemaker/evaluation.py#L50
It also computes the correlation of the sampled pairwise correlations.
I wonder if those similar utilities could be part of the new
scikits.learn.metrics.pairwise module (or maybe in another).
Indeed, they could be of interest.

On a side note, I didn't have time to finish the manifold-light
branch, I hope to have more time during the holidays (which I don't
have, but that's another story :|). As all these topics are related to
manifold learning, it's an update to the current situation ;)

Matthieu
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
Mathieu Blondel
2010-12-13 07:18:50 UTC
Hello,

Sorry I don't have time to read the thread but I just want to point
out that scipy has some support for distance metrics in:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html

This is wrapped C-code and it supports outputting the results in
square form or packed form (without the diagonal elements though).

I think it supports dense matrices only though (I'm mostly interested
in sparse matrices).

On a related note, it would be nice to have reusable dense and sparse
modules to compute kernel (gram) matrices for well-known kernels. This
is useful for using in estimators which have an option
kernel="precomputed". I have code for the sparse case but this is C
code wrapped in Cython (I know you guys prefer all Cython).

Mathieu
Olivier Grisel
2010-12-13 10:46:34 UTC
Post by Mathieu Blondel
Hello,
Sorry I don't have time to read the thread but I just want to point
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
This is wrapped C-code and it supports outputting the results in
square form or packed form (without the diagonal elements though).
Good point. I overlooked that module so far.
Post by Mathieu Blondel
I think it supports dense matrices only though (I'm mostly interested
in sparse matrices).
Indeed.
Post by Mathieu Blondel
On a related note, it would be nice to have reusable dense and sparse
modules to compute kernel (gram) matrices for well-known kernels. This
is useful for using in estimators which have an option
kernel="precomputed". I have code for the sparse case but this is C
code wrapped in Cython (I know you guys prefer all Cython).
What "well-known" kernels do you have in mind? As for C vs Cython, you
can put it in a branch and invite others to help you cythonify it
progressively.
--
Olivier
Mathieu Blondel
2010-12-13 11:23:33 UTC
On Mon, Dec 13, 2010 at 7:46 PM, Olivier Grisel
Post by Olivier Grisel
What "well-known" kernels do you have in mind?
I was thinking of linear, polynomial, rbf/gaussian, sigmoid etc. It
would be nice to factor out this code so that it can be reused in
different estimators. It may be the opportunity to use SIMD
instructions and/or BLAS goodness for the dense case. Also since
distance and kernel matrices are usually symmetric, having an option
to output them in packed format would result in substantial memory
saving.

Mathieu
Gael Varoquaux
2010-12-13 11:24:43 UTC
Post by Mathieu Blondel
I was thinking of linear, polynomial, rbf/gaussian, sigmoid etc. It
would be nice to factor out this code so that it can be reused in
different estimators. It may be the opportunity to use SIMD
instructions and/or BLAS goodness for the dense case. Also since
distance and kernel matrices are usually symmetric, having an option
to output them in packed format would result in substantial memory
saving.
+1. After the release however :) (that fine, we have branches for that).

Gaël
Matthieu Brucher
2010-12-13 11:30:54 UTC
This would be indeed great (the kernels and the release :D)

Matthieu
Post by Gael Varoquaux
Post by Mathieu Blondel
I was thinking of linear, polynomial, rbf/gaussian, sigmoid etc. It
would be nice to factor out this code so that it can be reused in
different estimators. It may be the opportunity to use SIMD
instructions and/or BLAS goodness for the dense case. Also since
distance and kernel matrices are usually symmetric, having an option
to output them in packed format would result in substantial memory
saving.
+1. After the release however :) (that fine, we have branches for that).
Gaël
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
Mathieu Blondel
2010-12-13 11:49:17 UTC
Post by Alexandre Gramfort
scikits.learn.metrics
scikits.learn.metrics.pairwise
scikits.learn.metrics.pairwise.sparse
scikits.learn.metrics.kernel
scikits.learn.metrics.kernel.sparse
Hum, I've just realized that putting kernels in the pairwise module is
also an option!

Mathieu
Alexandre Gramfort
2010-12-13 14:25:03 UTC
looks like we have a cool project for the next sprint after the release :)

Alex
Post by Mathieu Blondel
Post by Alexandre Gramfort
scikits.learn.metrics
scikits.learn.metrics.pairwise
scikits.learn.metrics.pairwise.sparse
scikits.learn.metrics.kernel
scikits.learn.metrics.kernel.sparse
Hum, I've just realized that putting kernels in the pairwise module is
also an option!
Mathieu
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2010-12-13 15:18:16 UTC
Post by Alexandre Gramfort
looks like we have a cool project for the next sprint after the release :)
Great when is this scheduled? Would it be possible to organize it
somewhere were you don't need to climb 200m high on a cold
pseudo-mountain? :)
--
Olivier
Alexandre Gramfort
2010-12-14 02:31:44 UTC
Post by Olivier Grisel
Great when is this scheduled? Would it be possible to organize it
somewhere were you don't need to climb 200m high on a cold
pseudo-mountain? :)
would Boston work for you? :)

Alex
Mathieu Blondel
2010-12-13 11:46:25 UTC
On Mon, Dec 13, 2010 at 8:24 PM, Gael Varoquaux
Post by Gael Varoquaux
+1. After the release however :) (that fine, we have branches for that).
OK for after the release, there's no rush!

So we can have:

scikits.learn.metrics
scikits.learn.metrics.pairwise
scikits.learn.metrics.pairwise.sparse
scikits.learn.metrics.kernel
scikits.learn.metrics.kernel.sparse

Also, I was thinking of splitting the current metrics module into

scikits.learn.metrics.score
scikits.learn.metrics.loss

but I didn't give it much thought!

For both pairwise and kernel, I want pdist(X, metric) (pairwise
distance matrix of X with itself) and cdist(Xa, Xb, metric) (pairwise
distance matrix of Xa and Xb). For scikits.learn.metrics.pairwise, we
can just import the functions from scipy.spatial.distance. For
scikits.learn.metrics.pairwise.sparse, we can implement it ourselves
and maybe this can make it to scipy later on... We're reaching a point
where we need things which should rather be in scipy but aren't
available yet!

Mathieu
Zak Stone
2011-01-17 13:42:22 UTC
Hi folks,

I just ran across scikits.learn.metrics.pairwise, and I have two quick

1. An old NumPy discussion thread empirically claims that a different
algorithm would be faster than the one currently implemented, though a
lot has probably changed since them:

http://mail.scipy.org/pipermail/numpy-discussion/2007-April/027203.html
http://mail.scipy.org/pipermail/numpy-discussion/2007-April/027166.html

2. The method is called euclidian_distances, but "Euclidean" is the
more common spelling.

Also, it might be interesting to consider the cases in which one or
both of the vector sets being compared do not fit in memory. This
comes up in large-scale brute-force nearest-neighbor computations.

Zak

On Sun, Dec 12, 2010 at 4:18 PM, Alexandre Gramfort
Post by Alexandre Gramfort
just pushed scikits.learn.metrics.pairwise
nothing is definitive, but I think it's a first rather good solution.
Alex
On Sun, Dec 12, 2010 at 3:58 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Matthieu Brucher
In the manifold module, the affinity/similarity is the inverse of the
distance/metrics, so they are pretty much related, yes ;)
Ths is why I can understand Vincent's proposal as well. And it is also
why I offered the utils package, as everything graph related is in
this package (and as for metrics, they can be of use for the end user
as well).
My issue with 'utils' is that it is a non-descriptive name. Nobody ever
looks in utils when searching for something, or only after looking
everywhere else.
Naming things is hard, but we should not give up on it.
G
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alexandre Gramfort
2011-01-17 13:57:47 UTC
Hi Zak,
Post by Zak Stone
I just ran across scikits.learn.metrics.pairwise, and I have two quick
1. An old NumPy discussion thread empirically claims that a different
algorithm would be faster than the one currently implemented, though a
http://mail.scipy.org/pipermail/numpy-discussion/2007-April/027203.html
http://mail.scipy.org/pipermail/numpy-discussion/2007-April/027166.html
interesting. Could you reproduce the bench to see if they are possible
improvements in this code?
Post by Zak Stone
2. The method is called euclidian_distances, but "Euclidean" is the
more common spelling.
if you suggest to rename the function, it would be against the PEP8 coding
style. If it's in the docstring, it should be changed.
Post by Zak Stone
Also, it might be interesting to consider the cases in which one or
both of the vector sets being compared do not fit in memory. This
comes up in large-scale brute-force nearest-neighbor computations.
you mean X very big and Y small or vice versa? how would you address it?

Alex
Alexandre Passos
2011-01-17 14:00:26 UTC
On Mon, Jan 17, 2011 at 11:57, Alexandre Gramfort
Post by Alexandre Gramfort
Hi Zak,
Post by Zak Stone
I just ran across scikits.learn.metrics.pairwise, and I have two quick
1. An old NumPy discussion thread empirically claims that a different
algorithm would be faster than the one currently implemented, though a
http://mail.scipy.org/pipermail/numpy-discussion/2007-April/027203.html
http://mail.scipy.org/pipermail/numpy-discussion/2007-April/027166.html
interesting. Could you reproduce the bench to see if they are possible
improvements in this code?
Post by Zak Stone
2. The method is called euclidian_distances, but "Euclidean" is the
more common spelling.
if you suggest to rename the function, it would be against the PEP8 coding
style. If it's in the docstring, it should be changed.
I think his point is that it should be euclidEan_distance instead of
euclidIan_distance (which is more like portuguese and I assume
french).
--
- Alexandre
Alexandre Gramfort
2011-01-17 14:04:42 UTC
Post by Alexandre Passos
Post by Alexandre Gramfort
if you suggest to rename the function, it would be against the PEP8 coding
style. If it's in the docstring, it should be changed.
I think his point is that it should be euclidEan_distance instead of
euclidIan_distance (which is more like portuguese and I assume
french).
indeed it's with an "i" in french :)

I'll fix it

Alex
Matthieu Brucher
2011-01-17 14:12:55 UTC
Post by Zak Stone
Hi folks,
I just ran across scikits.learn.metrics.pairwise, and I have two quick
1. An old NumPy discussion thread empirically claims that a different
algorithm would be faster than the one currently implemented, though a
http://mail.scipy.org/pipermail/numpy-discussion/2007-April/027203.html
http://mail.scipy.org/pipermail/numpy-discussion/2007-April/027166.html
2. The method is called euclidian_distances, but "Euclidean" is the
more common spelling.
Also, it might be interesting to consider the cases in which one or
both of the vector sets being compared do not fit in memory. This
comes up in large-scale brute-force nearest-neighbor computations.
Zak
This the function that was implemented in the different manifold branches. I
was told that the scikit's function was more efficient.

Matthieu
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
Alexandre Gramfort
2011-01-17 14:15:50 UTC
Post by Matthieu Brucher
This the function that was implemented in the different manifold branches. I
was told that the scikit's function was more efficient.
As I said feel free to send a patch to improve this function

When I implemented it I thought it was the best I could do. I can be wrong.

Alex
Zak Stone
2011-01-17 14:24:02 UTC
Thanks for updating the spelling! My guess is that the best choice
will depend heavily on the dimensionality of the features and other
aspects of the problem, so there may not be a universal "right

Zak

On Mon, Jan 17, 2011 at 9:15 AM, Alexandre Gramfort
Post by Alexandre Gramfort
Post by Matthieu Brucher
This the function that was implemented in the different manifold branches. I
was told that the scikit's function was more efficient.
As I said feel free to send a patch to improve this function
When I implemented it I thought it was the best I could do. I can be wrong.
Alex
------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alexandre Gramfort
2011-01-17 14:26:20 UTC
I thing we should provide both implementations then.

Alex
Post by Zak Stone
Thanks for updating the spelling! My guess is that the best choice
will depend heavily on the dimensionality of the features and other
aspects of the problem, so there may not be a universal "right
Zak
On Mon, Jan 17, 2011 at 9:15 AM, Alexandre Gramfort
Post by Alexandre Gramfort
Post by Matthieu Brucher
This the function that was implemented in the different manifold branches. I
was told that the scikit's function was more efficient.
As I said feel free to send a patch to improve this function
When I implemented it I thought it was the best I could do. I can be wrong.
Alex
------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
James Bergstra
2011-01-17 14:42:00 UTC
Roughy speaking in low-d the for loop implementation might be better, but
beyond a few dimensions (2? 7? 50?) the BLAS-powered version should be
faster.

I'm afraid the best choice will also depend on the various cache sizes in
the computer, and the version of BLAS that is installed. Picking an
automatic tradeoff point is probably not easy.

On Mon, Jan 17, 2011 at 9:26 AM, Alexandre Gramfort <
Post by Alexandre Gramfort
I thing we should provide both implementations then.
Alex
Post by Zak Stone
Thanks for updating the spelling! My guess is that the best choice
will depend heavily on the dimensionality of the features and other
aspects of the problem, so there may not be a universal "right
Zak
On Mon, Jan 17, 2011 at 9:15 AM, Alexandre Gramfort
Post by Alexandre Gramfort
Post by Matthieu Brucher
This the function that was implemented in the different manifold
branches. I
Post by Zak Stone
Post by Alexandre Gramfort
Post by Matthieu Brucher
was told that the scikit's function was more efficient.
As I said feel free to send a patch to improve this function
When I implemented it I thought it was the best I could do. I can be
wrong.
------------------------------------------------------------------------------
Post by Zak Stone
Post by Alexandre Gramfort
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Zak Stone
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
http://www-etud.iro.umontreal.ca/~bergstrj
Zak Stone
2011-01-17 14:51:34 UTC
Post by James Bergstra
I'm afraid the best choice will also depend on the various cache sizes in
the computer, and the version of BLAS that is installed. Picking an
automatic tradeoff point is probably not easy.
This is my impression as well. And everything changes again when you
work with datasets that are larger than memory; in my own experiments,
disk bandwidth seemed to be the bottleneck, so I ended up comparing
chunks of the X and Y arrays that would fit in memory to do a single
pass through each. I also used Cython and pthreads to take advantage
of all available cores in a straightforward way.

Zak
Olivier Grisel
2011-01-17 14:56:05 UTC
Post by Zak Stone
Post by James Bergstra
I'm afraid the best choice will also depend on the various cache sizes in
the computer, and the version of BLAS that is installed. Picking an
automatic tradeoff point is probably not easy.
This is my impression as well. And everything changes again when you
work with datasets that are larger than memory; in my own experiments,
disk bandwidth seemed to be the bottleneck, so I ended up comparing
chunks of the X and Y arrays that would fit in memory to do a single
pass through each. I also used Cython and pthreads to take advantage
of all available cores in a straightforward way.
Any idea if cython + the right gcc flags are able to unroll and SSE
vectorize the loops with mul-adds -style instructions ?
--
Olivier
Alexandre Gramfort
2011-01-17 15:01:03 UTC
or so maybe this other implementation can be interesting for manifold
learning when the neighborhoods are small.

Alex

On Mon, Jan 17, 2011 at 9:42 AM, James Bergstra
Post by James Bergstra
Roughy speaking in low-d the for loop implementation might be better, but
beyond a few dimensions (2? 7? 50?) the BLAS-powered version should be
faster.
I'm afraid the best choice will also depend on the various cache sizes in
the computer, and the version of BLAS that is installed. Picking an
automatic tradeoff point is probably not easy.
On Mon, Jan 17, 2011 at 9:26 AM, Alexandre Gramfort
Post by Alexandre Gramfort
I thing we should provide both implementations then.
Alex
Post by Zak Stone
Thanks for updating the spelling! My guess is that the best choice
will depend heavily on the dimensionality of the features and other
aspects of the problem, so there may not be a universal "right
Zak
On Mon, Jan 17, 2011 at 9:15 AM, Alexandre Gramfort
Post by Alexandre Gramfort
Post by Matthieu Brucher
This the function that was implemented in the different manifold branches. I
was told that the scikit's function was more efficient.
As I said feel free to send a patch to improve this function
When I implemented it I thought it was the best I could do. I can be wrong.
Alex
------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
http://www-etud.iro.umontreal.ca/~bergstrj
------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general