[Scikit-learn-general] The confusion about similarity matrix in spectral clustering

Discussion:

xinfan meng

2011-01-17 07:14:36 UTC

Documents gave an example of spectral clustering, but it is not
immediately clear what the requirement of the similarity matrix should
satisfy.

Gael Varoquaux

2011-01-17 08:00:48 UTC

Permalink

I will need to check (right now I am on a mobile phone) but I think that the only restrictions on the similarity matrix is that it is made of positive or zero coefficients. The code might not handle graphs with several connect componants, but we should be able to fix that.

Anyhow, the docs need to be fixed to specify that matrix better. Could you report back what you found worked for you.

Cheers,

Gael

----- Original message -----

Post by xinfan meng
Documents gave an example of spectral clustering, but it is not
immediately clear what the requirement of the similarity matrix should
satisfy.

xinfan meng

2011-01-17 08:25:17 UTC

Permalink

np.exp(- sim_matrix ^2 / 2 * delta^2) seems to work. Some papers have
used this similarity function too. So I think I might just settle on
this function for this moment.

On Mon, Jan 17, 2011 at 4:00 PM, Gael Varoquaux

Post by Gael Varoquaux
I will need to check (right now I am on a mobile phone) but I think that the only restrictions on the similarity matrix is that it is made of positive or zero coefficients. The code might not handle graphs with several connect componants, but we should be able to fix that.
Anyhow, the docs need to be fixed to specify that matrix better. Could you report back what you found worked for you.
Cheers,
Gael
----- Original message -----

Post by xinfan meng
Documents gave an example of spectral clustering, but it is not
immediately clear what the requirement of the similarity matrix should
satisfy.

Matthieu Brucher

2011-01-17 08:34:09 UTC

Permalink

Hi,

It's strange that this works, because you already have a similarity matrix.
This function changes a similarity matrix to a distance-like one and
vice-versa. So once you pass through this kernel (it's the heat kernel FYI),
you transform small similarities (near 0) to big ones (around 1), and big
similarities to small ones (near 0). This is _not_ what you want for
spectral clustering.
Perhaps there is an issue with your data type as well?

Matthieu

Post by xinfan meng
np.exp(- sim_matrix ^2 / 2 * delta^2) seems to work. Some papers have
used this similarity function too. So I think I might just settle on
this function for this moment.
On Mon, Jan 17, 2011 at 4:00 PM, Gael Varoquaux

Post by Gael Varoquaux
I will need to check (right now I am on a mobile phone) but I think that

the only restrictions on the similarity matrix is that it is made of
positive or zero coefficients. The code might not handle graphs with several
connect componants, but we should be able to fix that.

Post by Gael Varoquaux
Anyhow, the docs need to be fixed to specify that matrix better. Could

you report back what you found worked for you.

Post by Gael Varoquaux
Cheers,
Gael
----- Original message -----

Post by xinfan meng
Documents gave an example of spectral clustering, but it is not
immediately clear what the requirement of the similarity matrix should
satisfy.

Gael Varoquaux

2011-01-17 08:57:46 UTC

Permalink

Post by xinfan meng
np.exp(- sim_matrix ^2 / 2 * delta^2) seems to work. Some papers have
used this similarity function too.

Yes, it's called the heat kernel, and it is very often used to map values
from valued similarities to difference.

Also, as this function is a decreasing function, the fact that it works
better on your data makes me realize that you were probably passing in a
similarity matrix (small values = entries not alike, large values =
entries similar) wereas the code actually wants a distance matrix (0 =
same entries, large = not similar). We should also add this in the
documentation.

Can you confirm this, so that I make the documentation more explicit.

G

Post by xinfan meng
On Mon, Jan 17, 2011 at 4:00 PM, Gael Varoquaux

Post by xinfan meng
Documents gave an example of spectral clustering, but it is not
immediately clear what the requirement of the similarity matrix should
satisfy.

xinfan meng

2011-01-17 09:05:27 UTC

Permalink

Sorry, I got it wrong. I have not actually checked the result.
By "works" I meant it did not trigger any error.

I need to do further experiments, I am working on a large matrix. So I
could not immediately tell if the result is better or not.

On Mon, Jan 17, 2011 at 4:57 PM, Gael Varoquaux

Post by Gael Varoquaux

Post by xinfan meng
np.exp(- sim_matrix ^2 / 2 * delta^2) seems to work. Some papers have
used this similarity function too.

Yes, it's called the heat kernel, and it is very often used to map values
from valued similarities to difference.
Also, as this function is a decreasing function, the fact that it works
better on your data makes me realize that you were probably passing in a
similarity matrix (small values = entries not alike, large values =
entries similar) wereas the code actually wants a distance matrix (0 =
same entries, large = not similar). We should also add this in the
documentation.
Can you confirm this, so that I make the documentation more explicit.
G

Post by xinfan meng
On Mon, Jan 17, 2011 at 4:00 PM, Gael Varoquaux

Post by xinfan meng
Documents gave an example of spectral clustering, but it is not
immediately clear what the requirement of the similarity matrix should
satisfy.

Matthieu Brucher

2011-01-17 08:30:23 UTC

Permalink

Hi,

Spectral clustering should work with several graph components. Each
component will have a set of coordinates in the embedded space, so if this
case arises, you have to use a higher-dimension embedded space.

Matthieu

Post by Gael Varoquaux
I will need to check (right now I am on a mobile phone) but I think that
the only restrictions on the similarity matrix is that it is made of
positive or zero coefficients. The code might not handle graphs with several
connect componants, but we should be able to fix that.
Anyhow, the docs need to be fixed to specify that matrix better. Could you
report back what you found worked for you.
Cheers,
Gael
----- Original message -----

Post by xinfan meng
Documents gave an example of spectral clustering, but it is not
immediately clear what the requirement of the similarity matrix should
satisfy.

Gael Varoquaux

2011-01-17 08:55:37 UTC

Permalink

Post by Matthieu Brucher
Spectral clustering should work with several graph components. Each
component will have a set of coordinates in the embedded space, so if this
case arises, you have to use a higher-dimension embedded space.

It _should_, but the current implementation in the scikit won't.

As a side note, I don't think that the implements of spectral embedding
in the manifold branch were robust to that problem either. In my
experience, the only way to be robust to that problem is to split the
graph in its connect components and apply the spectral method to each one
of these.

G

xinfan meng

2011-01-17 09:11:56 UTC

Permalink

As I can see from the source, the clustering is done with k-means. So
is it the algorithm described in " On spectral clustering: analysis
and algorithm" by Andrew Y. Ng etc.?

On Mon, Jan 17, 2011 at 4:55 PM, Gael Varoquaux

Post by Gael Varoquaux

Spectral clustering should work with several graph components. Each
component will have a set of coordinates in the embedded space, so if this
case arises, you have to use a higher-dimension embedded space.

It _should_, but the current implementation in the scikit won't.
As a side note, I don't think that the implements of spectral embedding
in the manifold branch were robust to that problem either. In my
experience, the only way to be robust to that problem is to split the
graph in its connect components and apply the spectral method to each one
of these.
G

Gael Varoquaux

2011-01-17 09:29:54 UTC

Permalink

Post by xinfan meng
As I can see from the source, the clustering is done with k-means. So
is it the algorithm described in " On spectral clustering: analysis
and algorithm" by Andrew Y. Ng etc.?

The k-means part is not the part Matthieu was refering to, but rather
whether the graph Laplacian is normalized or not. In the code you are
using, the graph Laplacian is normalized. It is often call 'normalized
cuts' in the graph cuts framework.

I believe that the algorithm implemented is the one cited in the Ng paper
as the Meila and Shi algorithm. I am not sure what the canonical
reference for this algorithm is (Matthieu, any suggestion?). For the
2-cluster problem, it is the Shi and Malik paper on normalized cuts.

HTH,

Gaël

Matthieu Brucher

2011-01-17 09:58:01 UTC

Permalink

Post by Gael Varoquaux

Post by xinfan meng
As I can see from the source, the clustering is done with k-means. So
is it the algorithm described in " On spectral clustering: analysis
and algorithm" by Andrew Y. Ng etc.?

The k-means part is not the part Matthieu was refering to, but rather
whether the graph Laplacian is normalized or not. In the code you are
using, the graph Laplacian is normalized. It is often call 'normalized
cuts' in the graph cuts framework.
I believe that the algorithm implemented is the one cited in the Ng paper
as the Meila and Shi algorithm. I am not sure what the canonical
reference for this algorithm is (Matthieu, any suggestion?). For the
2-cluster problem, it is the Shi and Malik paper on normalized cuts.

I think it is the same.
The normalization of the Laplacian is always used (well, at least when
solving the problem this way, the problem can be expressed differently, but
with the same solution, cf papers on Diffusion Maps), but you can also
normalize the input distance matrix before making a similarity one. In your
case, you have only an similarity one, so you don't have to normalize it.

As Gaël sayd in the new docstring, the input for the spectral_clustering
class is a similarity matrix (cf also his examples in the examples folder
that shows how he created a similarity matrix from a distance matrix for the
lena image).

Matthieu

--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher

Matthieu Brucher

2011-01-17 09:46:27 UTC

Permalink

Post by Gael Varoquaux

Post by Matthieu Brucher
Spectral clustering should work with several graph components. Each
component will have a set of coordinates in the embedded space, so if

this

Post by Matthieu Brucher
case arises, you have to use a higher-dimension embedded space.

It _should_, but the current implementation in the scikit won't.
As a side note, I don't think that the implements of spectral embedding
in the manifold branch were robust to that problem either. In my
experience, the only way to be robust to that problem is to split the
graph in its connect components and apply the spectral method to each one
of these.

You don't have to split the graph into connected compnents for the manifold
branch, at least when I checked it at the beginning (I had one test IIRC,
and the plot displayed 2 components).

One way to test for this is to use the IRIS dataset, get rid of the one
sample that makes the graph connected with a K-neighborshood=8, create the
similarity matrix and display the plot. On 3D, it should display two classes
in a 2D plan, and the last class in the third dimension.

I agree that splitting the samples is best if you want each component with a
specific number of dimension, but if you only care about the most important
dimensions, the diagonalization doesn't care about the graph being connected
(which isn't the case for algorithms based on distances).

Matthieu

--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher

Gael Varoquaux

2011-01-17 09:50:02 UTC

Permalink

Post by Matthieu Brucher
You don't have to split the graph into connected compnents for the
manifold branch, at least when I checked it at the beginning (I had one
test IIRC, and the plot displayed 2 components).
One way to test for this is to use the IRIS dataset, get rid of the one
sample that makes the graph connected with a K-neighborshood=8, create the
similarity matrix and display the plot. On 3D, it should display two
classes in a 2D plan, and the last class in the third dimension.
I agree that splitting the samples is best if you want each component with
a specific number of dimension, but if you only care about the most
important dimensions, the diagonalization doesn't care about the graph
being connected (which isn't the case for algorithms based on distances).

Maybe this works on this specific example, but in my experience, it is
very unstable on large datasets. The way I understand it, is that the
Laplacian has a kernel subspace of dimension 2 and that the resulting
ambiguity (any vector in the subspace can be solution of the eigenvalue
problem) makes the linear algebra solvers unstable.

G

Matthieu Brucher

2011-01-17 10:04:58 UTC

Permalink

Post by Matthieu Brucher

Post by Matthieu Brucher
You don't have to split the graph into connected compnents for the
manifold branch, at least when I checked it at the beginning (I had

one

Post by Matthieu Brucher
test IIRC, and the plot displayed 2 components).
One way to test for this is to use the IRIS dataset, get rid of the

one

Post by Matthieu Brucher
sample that makes the graph connected with a K-neighborshood=8, create

the

Post by Matthieu Brucher
similarity matrix and display the plot. On 3D, it should display two
classes in a 2D plan, and the last class in the third dimension.
I agree that splitting the samples is best if you want each component

with

Post by Matthieu Brucher
a specific number of dimension, but if you only care about the most
important dimensions, the diagonalization doesn't care about the graph
being connected (which isn't the case for algorithms based on

distances).
Maybe this works on this specific example, but in my experience, it is
very unstable on large datasets. The way I understand it, is that the
Laplacian has a kernel subspace of dimension 2 and that the resulting
ambiguity (any vector in the subspace can be solution of the eigenvalue
problem) makes the linear algebra solvers unstable.

Perhaps... What I said is what I understood from the LLE paper, so perhaps
the Laplacian kernel has some "features" that would impede this approach in
general.

Matthieu

--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher

Matthieu Brucher

2011-01-17 08:36:30 UTC

Permalink

However, I am not satisfied with this solution.This value is too
arbitrary. Is there some guidelines about how to post-process this
matrix?
Should I normalize this matrix? Thanks.

Some spectral algorithms do, other don't. Lapalcian Eigenmap uses the
similarity matrix directly, but Diffusion Map normalizes the matrix.

Matthieu

--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher

Gael Varoquaux

2011-01-17 09:34:33 UTC

Permalink

When I directly use this matrix, I get an error "ValueError: array
must not contain infs or NaNs". I figure it is because there are zeros
in matrix.

Could you please send a full trace back, so that I can see if what the
problem is exactly, and if I can make the code more robust.

So I add 1 to the matrix, but then I get "LinAlgError: 7-th leading
minor not positive definite".

Same thing here,

Cheers,

Gaël

xinfan meng

2011-01-17 11:13:55 UTC

Permalink

It is strange that I can not reproduce the results now. I will try later.

BTW, when I tried to compute cosine distance for my matrix, I found
it too slow.

Then I tried the following codes:

In [2]: import numpy as np

In [3]: data = np.random.randn(5000, 5000)

In [4]: data1 = np.random.randn(5000, 5000)

In [5]: np.dot(data, data1)

the dot operation took minutes, which is too slow for Intel(R) Xeon(R)
CPU W3503 @ 2.40GHz with 12 GB memory.

Do you think it is a problem caused by installing LAPACK and BLAS
instead of ATLAS ?

On Mon, Jan 17, 2011 at 5:34 PM, Gael Varoquaux

Post by Gael Varoquaux

When I directly use this matrix, I get an error "ValueError: array
must not contain infs or NaNs". I figure it is because there are zeros
in matrix.

Could you please send a full trace back, so that I can see if what the
problem is exactly, and if I can make the code more robust.

So I add 1 to the matrix, but then I get "LinAlgError: 7-th leading
minor not positive definite".

Same thing here,
Cheers,
Gaël
------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Alexandre Gramfort

2011-01-17 14:02:33 UTC

Permalink

FYI it takes 25s on my laptop with EPD linked with Intel MKL

Indeed, I suspect your version of numpy uses the manually written C
code of numpy
and not atlas.

Alex

Post by xinfan meng
It is strange that I can not reproduce the results now. I will try later.
BTW, when I tried to compute cosine distance for my matrix, I found
it too slow.
In [2]: import numpy as np
In [3]: data = np.random.randn(5000, 5000)
In [4]: data1 = np.random.randn(5000, 5000)
In [5]: np.dot(data, data1)
the dot operation took minutes, which is too slow for Intel(R) Xeon(R)
Do you think it is a problem caused by installing LAPACK and BLAS
instead of ATLAS ?
On Mon, Jan 17, 2011 at 5:34 PM, Gael Varoquaux

Post by Gael Varoquaux

When I directly use this matrix, I get an error "ValueError: array
must not contain infs or NaNs". I figure it is because there are zeros
in matrix.

Could you please send a full trace back, so that I can see if what the
problem is exactly, and if I can make the code more robust.

So I add 1 to the matrix, but then I get "LinAlgError: 7-th leading
minor not positive definite".

--
Best Wishes
--------------------------------------------
Meng Xinfan（蒙新泛）
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China
------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Olivier Grisel

2011-01-17 14:14:38 UTC

Permalink

Post by Alexandre Gramfort
FYI it takes 25s on my laptop with EPD linked with Intel MKL
Indeed, I suspect your version of numpy uses the manually written C
code of numpy
and not atlas.

I takes 55s on my laptop with the ubuntu 64bits distributed numpy +
atlas (which is unfortunately not multi-threaded).

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

xinfan meng

2011-01-17 14:36:51 UTC

Permalink

Thanks, The case is confirmed. After I installed ATLAS, the same codes
can be run within 5 seconds. I am using Arch Linux and the
installation is relatively simple. I think numpy should depends on
ATLAS instead of LAPACK and BLAS. Otherwise there would certainly be
many people complaining about Python's efficiency. BTW, the same codes
also ran very fast in Sage, without the need to compiling source
manually.

I am trying affinity propagation and spectral clustering on words now.
I hope that I can report the results very soon.

On Mon, Jan 17, 2011 at 10:14 PM, Olivier Grisel

Post by Olivier Grisel

Post by Alexandre Gramfort
FYI it takes 25s on my laptop with EPD linked with Intel MKL
Indeed, I suspect your version of numpy uses the manually written C
code of numpy
and not atlas.

I takes 55s on my laptop with the ubuntu 64bits distributed numpy +
atlas (which is unfortunately not multi-threaded).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Gael Varoquaux

2011-01-17 16:00:15 UTC

Permalink

Post by xinfan meng
Thanks, The case is confirmed. After I installed ATLAS, the same codes
can be run within 5 seconds. I am using Arch Linux and the
installation is relatively simple. I think numpy should depends on
ATLAS instead of LAPACK and BLAS.

Bug report to your distribution, they are the one who make these
decisions.

Gaël