Discussion:
Unexpected behavior of Isomap
Alejandro Weinstein
2011-11-07 02:45:57 UTC
Hi:

I am observing an unexpected behavior of Isomap, related to the
dimensions of the transformed data. If I generate random data, say
1000 points each with dimension 10, and fit a transform using as a
parameter out_dim=3, the fitted data has dimension (1000, 3), as
expected. However, when I repeat the same steps but this time using my
data set consisting of 427 points, each of dimension 400, the fitted
data has dimension (427, 2), i.e., the output dimension is 1 less than
out_dim. Using LLE with the same data set and parameters, the fitted
data has the expected dimension (427, 3).

The following code illustrate the phenomena:

#############################################
import numpy as np
from sklearn import manifold

n = 1000;
m = 10;
X = np.random.rand(n,m)
n_neighbors = 5
out_dim = 3

Y = manifold.Isomap(n_neighbors, out_dim).fit_transform(X)
print 'Using random data and Isomap'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)

Y = manifold.Isomap(n_neighbors, out_dim).fit_transform(X)
print
print 'Using the data X.npy and Isomap'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)

Y = manifold.LocallyLinearEmbedding(n_neighbors, out_dim).fit_transform(X)
print
print 'Using the data X.npy and LLE'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)
##################################################################

And this is the output:

Using random data and Isomap
X shape:(1000, 10), out_dim:3, Y shape: (1000, 3)

Using the data X.npy and Isomap
X shape:(427, 400), out_dim:3, Y shape: (427, 2)

Using the data X.npy and LLE
X shape:(427, 400), out_dim:3, Y shape: (427, 3)

The code and the data set is available at
https://github.com/aweinstein/scrapcode

In case it is relevant, the data set consist of documents represented
in the Latent Semantic Analysis space.

Is this the expected behavior of Isomap, or is there something wrong?

Alejandro.
Jacob VanderPlas
2011-11-07 03:53:46 UTC
Alejandro,
It looks like the problem can be traced back to the ARPACK eigensolver.
If you run the code with eigen_solver='dense', it works as expected.
Sometimes arpack does not converge to all the requested eigenvalues, and
I guess there's no error reported when that happens.

I tried performing the eigenvalue decomposition using the scipy
development version of arpack, and it gives 3 dimensions as expected.
It may be that we can fix this by updating the arpack wrapper from scipy.
Jake
Post by Alejandro Weinstein
I am observing an unexpected behavior of Isomap, related to the
dimensions of the transformed data. If I generate random data, say
1000 points each with dimension 10, and fit a transform using as a
parameter out_dim=3, the fitted data has dimension (1000, 3), as
expected. However, when I repeat the same steps but this time using my
data set consisting of 427 points, each of dimension 400, the fitted
data has dimension (427, 2), i.e., the output dimension is 1 less than
out_dim. Using LLE with the same data set and parameters, the fitted
data has the expected dimension (427, 3).
#############################################
import numpy as np
from sklearn import manifold
n = 1000;
m = 10;
X = np.random.rand(n,m)
n_neighbors = 5
out_dim = 3
Y = manifold.Isomap(n_neighbors, out_dim).fit_transform(X)
print 'Using random data and Isomap'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)
Y = manifold.Isomap(n_neighbors, out_dim).fit_transform(X)
print
print 'Using the data X.npy and Isomap'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)
Y = manifold.LocallyLinearEmbedding(n_neighbors, out_dim).fit_transform(X)
print
print 'Using the data X.npy and LLE'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)
##################################################################
Using random data and Isomap
X shape:(1000, 10), out_dim:3, Y shape: (1000, 3)
Using the data X.npy and Isomap
X shape:(427, 400), out_dim:3, Y shape: (427, 2)
Using the data X.npy and LLE
X shape:(427, 400), out_dim:3, Y shape: (427, 3)
The code and the data set is available at
https://github.com/aweinstein/scrapcode
In case it is relevant, the data set consist of documents represented
in the Latent Semantic Analysis space.
Is this the expected behavior of Isomap, or is there something wrong?
Alejandro.
------------------------------------------------------------------------------
RSA(R) Conference 2012
Save \$700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Jacob VanderPlas
2011-11-07 04:22:06 UTC
I looked closer: turns out arpack is actually up-to-date.

I think the bug is in the kernel pca code: eigsh should be called with
keyword which='LA' rather than which='LM'. The fit_transform routine
was finding three vectors, and then removing the one with a negative
eigenvalue.

Before making this change, I want to understand what's going on. Does
anybody know if kernel PCA makes any assumptions about kernel form? I
know the kernel must be symmetric, but does the algorithm assume it's
positive (semi) definite?
Jake
Post by Jacob VanderPlas
Alejandro,
It looks like the problem can be traced back to the ARPACK
eigensolver. If you run the code with eigen_solver='dense', it works
as expected. Sometimes arpack does not converge to all the requested
eigenvalues, and I guess there's no error reported when that happens.
I tried performing the eigenvalue decomposition using the scipy
development version of arpack, and it gives 3 dimensions as expected.
It may be that we can fix this by updating the arpack wrapper from scipy.
Jake
Post by Alejandro Weinstein
I am observing an unexpected behavior of Isomap, related to the
dimensions of the transformed data. If I generate random data, say
1000 points each with dimension 10, and fit a transform using as a
parameter out_dim=3, the fitted data has dimension (1000, 3), as
expected. However, when I repeat the same steps but this time using my
data set consisting of 427 points, each of dimension 400, the fitted
data has dimension (427, 2), i.e., the output dimension is 1 less than
out_dim. Using LLE with the same data set and parameters, the fitted
data has the expected dimension (427, 3).
#############################################
import numpy as np
from sklearn import manifold
n = 1000;
m = 10;
X = np.random.rand(n,m)
n_neighbors = 5
out_dim = 3
Y = manifold.Isomap(n_neighbors, out_dim).fit_transform(X)
print 'Using random data and Isomap'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)
Y = manifold.Isomap(n_neighbors, out_dim).fit_transform(X)
print
print 'Using the data X.npy and Isomap'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)
Y = manifold.LocallyLinearEmbedding(n_neighbors,
out_dim).fit_transform(X)
print
print 'Using the data X.npy and LLE'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)
##################################################################
Using random data and Isomap
X shape:(1000, 10), out_dim:3, Y shape: (1000, 3)
Using the data X.npy and Isomap
X shape:(427, 400), out_dim:3, Y shape: (427, 2)
Using the data X.npy and LLE
X shape:(427, 400), out_dim:3, Y shape: (427, 3)
The code and the data set is available at
https://github.com/aweinstein/scrapcode
In case it is relevant, the data set consist of documents represented
in the Latent Semantic Analysis space.
Is this the expected behavior of Isomap, or is there something wrong?
Alejandro.
------------------------------------------------------------------------------
RSA(R) Conference 2012
Save \$700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Jacob VanderPlas
2011-11-07 19:32:00 UTC
I dug around a bit, and found some info about kernel form in this document:
http://people.kyb.tuebingen.mpg.de/lcayton/resexam.pdf

MDS (on which Isomap is based) assumes a Euclidean distance matrix,
which can be shown to always yield a positive semidefinite kernel. In
the case of Isomap, the distance matrix is not Euclidean in general, and
this can be fixed by ignoring any eigenvectors associated with negative
eigenvalues.
I think, based on this, that KernelPCA is correct as written, except
that the arpack method should use which='LA' rather than which='LM'
(thus ignoring any negative eigenvalues). This would fix Alejandro's
problem. I'll make the change in master.
Thanks for the detail & example code in your question Alejandro - it
made it very easy to track down this bug.
Jake
Post by Jacob VanderPlas
I looked closer: turns out arpack is actually up-to-date.
I think the bug is in the kernel pca code: eigsh should be called with
keyword which='LA' rather than which='LM'. The fit_transform routine
was finding three vectors, and then removing the one with a negative
eigenvalue.
Before making this change, I want to understand what's going on. Does
anybody know if kernel PCA makes any assumptions about kernel form? I
know the kernel must be symmetric, but does the algorithm assume it's
positive (semi) definite?
Jake
Post by Jacob VanderPlas
Alejandro,
It looks like the problem can be traced back to the ARPACK
eigensolver. If you run the code with eigen_solver='dense', it works
as expected. Sometimes arpack does not converge to all the requested
eigenvalues, and I guess there's no error reported when that happens.
I tried performing the eigenvalue decomposition using the scipy
development version of arpack, and it gives 3 dimensions as
expected. It may be that we can fix this by updating the arpack
wrapper from scipy.
Jake
Post by Alejandro Weinstein
I am observing an unexpected behavior of Isomap, related to the
dimensions of the transformed data. If I generate random data, say
1000 points each with dimension 10, and fit a transform using as a
parameter out_dim=3, the fitted data has dimension (1000, 3), as
expected. However, when I repeat the same steps but this time using my
data set consisting of 427 points, each of dimension 400, the fitted
data has dimension (427, 2), i.e., the output dimension is 1 less than
out_dim. Using LLE with the same data set and parameters, the fitted
data has the expected dimension (427, 3).
#############################################
import numpy as np
from sklearn import manifold
n = 1000;
m = 10;
X = np.random.rand(n,m)
n_neighbors = 5
out_dim = 3
Y = manifold.Isomap(n_neighbors, out_dim).fit_transform(X)
print 'Using random data and Isomap'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)
Y = manifold.Isomap(n_neighbors, out_dim).fit_transform(X)
print
print 'Using the data X.npy and Isomap'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)
Y = manifold.LocallyLinearEmbedding(n_neighbors,
out_dim).fit_transform(X)
print
print 'Using the data X.npy and LLE'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)
##################################################################
Using random data and Isomap
X shape:(1000, 10), out_dim:3, Y shape: (1000, 3)
Using the data X.npy and Isomap
X shape:(427, 400), out_dim:3, Y shape: (427, 2)
Using the data X.npy and LLE
X shape:(427, 400), out_dim:3, Y shape: (427, 3)
The code and the data set is available at
https://github.com/aweinstein/scrapcode
In case it is relevant, the data set consist of documents represented
in the Latent Semantic Analysis space.
Is this the expected behavior of Isomap, or is there something wrong?
Alejandro.
------------------------------------------------------------------------------
RSA(R) Conference 2012
Save \$700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alejandro Weinstein
2011-11-10 03:00:54 UTC
On Mon, Nov 7, 2011 at 12:32 PM, Jacob VanderPlas
Post by Jacob VanderPlas
I think, based on this, that KernelPCA is correct as written, except
that the arpack method should use which='LA' rather than which='LM'
(thus ignoring any negative eigenvalues).  This would fix Alejandro's
problem.  I'll make the change in master.
Does this also affect Hessian LLE and Local Tangent Space Alignment? I
am observing some problems with these two that are solved by using
eigen_solver='dense'. The following code illustrates the problem with
Hessian LLE (I reported the problem with LTSA in a previous post):

##############################################
import numpy as np
from sklearn import manifold

n = 1000;
m = 50;
X = np.random.rand(n,m)
out_dim = 2
n_neighbors = 10
Y = manifold.LocallyLinearEmbedding(n_neighbors, out_dim,
eigen_solver='dense',
method='hessian').fit_transform(X)
print "Just computed HLLE using eigen_solver='dense'"
Y = manifold.LocallyLinearEmbedding(n_neighbors, out_dim,
method='hessian').fit_transform(X)
##############################################

And this is the output

Just computed HLLE using eigen_solver='dense'
Traceback (most recent call last):
File "hlle_test.py", line 15, in <module>
method='hessian').fit_transform(X)
File "/usr/local/lib/python2.6/dist-packages/sklearn/manifold/locally_linear.py",
line 575, in fit_transform
self._fit_transform(X)
File "/usr/local/lib/python2.6/dist-packages/sklearn/manifold/locally_linear.py",
line 546, in _fit_transform
hessian_tol=self.hessian_tol, modified_tol=self.modified_tol)
File "/usr/local/lib/python2.6/dist-packages/sklearn/manifold/locally_linear.py",
line 459, in locally_linear_embedding
tol=tol, max_iter=max_iter)
File "/usr/local/lib/python2.6/dist-packages/sklearn/manifold/locally_linear.py",
line 141, in null_space
tol=tol, maxiter=max_iter)
File "/usr/local/lib/python2.6/dist-packages/sklearn/utils/arpack.py",
line 1488, in eigsh
symmetric=True, tol=tol)
File "/usr/local/lib/python2.6/dist-packages/sklearn/utils/arpack.py",
line 1010, in get_OPinv_matvec
return get_inv_matvec(A, symmetric=symmetric, tol=tol)
File "/usr/local/lib/python2.6/dist-packages/sklearn/utils/arpack.py",
line 1003, in get_inv_matvec
return SpLuInv(M).matvec
File "/usr/local/lib/python2.6/dist-packages/sklearn/utils/arpack.py",
line 896, in __init__
self.M_lu = splu(M)
File "/usr/local/lib/python2.6/dist-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 173, in splu
ilu=False, options=_options)
RuntimeError: Factor is exactly singular

Alejandro
bthirion
2011-11-10 08:21:54 UTC
In your case, the rank of the Gram matrix is much smaller than the
number of samples. All this means that we need to add some kind of
regularization to it.

Bertrand
Post by Alejandro Weinstein
On Mon, Nov 7, 2011 at 12:32 PM, Jacob VanderPlas
Post by Jacob VanderPlas
I think, based on this, that KernelPCA is correct as written, except
that the arpack method should use which='LA' rather than which='LM'
(thus ignoring any negative eigenvalues). This would fix Alejandro's
problem. I'll make the change in master.
Does this also affect Hessian LLE and Local Tangent Space Alignment? I
am observing some problems with these two that are solved by using
eigen_solver='dense'. The following code illustrates the problem with
##############################################
import numpy as np
from sklearn import manifold
n = 1000;
m = 50;
X = np.random.rand(n,m)
out_dim = 2
n_neighbors = 10
Y = manifold.LocallyLinearEmbedding(n_neighbors, out_dim,
eigen_solver='dense',
method='hessian').fit_transform(X)
print "Just computed HLLE using eigen_solver='dense'"
Y = manifold.LocallyLinearEmbedding(n_neighbors, out_dim,
method='hessian').fit_transform(X)
##############################################
And this is the output
Just computed HLLE using eigen_solver='dense'
File "hlle_test.py", line 15, in<module>
method='hessian').fit_transform(X)
File "/usr/local/lib/python2.6/dist-packages/sklearn/manifold/locally_linear.py",
line 575, in fit_transform
self._fit_transform(X)
File "/usr/local/lib/python2.6/dist-packages/sklearn/manifold/locally_linear.py",
line 546, in _fit_transform
hessian_tol=self.hessian_tol, modified_tol=self.modified_tol)
File "/usr/local/lib/python2.6/dist-packages/sklearn/manifold/locally_linear.py",
line 459, in locally_linear_embedding
tol=tol, max_iter=max_iter)
File "/usr/local/lib/python2.6/dist-packages/sklearn/manifold/locally_linear.py",
line 141, in null_space
tol=tol, maxiter=max_iter)
File "/usr/local/lib/python2.6/dist-packages/sklearn/utils/arpack.py",
line 1488, in eigsh
symmetric=True, tol=tol)
File "/usr/local/lib/python2.6/dist-packages/sklearn/utils/arpack.py",
line 1010, in get_OPinv_matvec
return get_inv_matvec(A, symmetric=symmetric, tol=tol)
File "/usr/local/lib/python2.6/dist-packages/sklearn/utils/arpack.py",
line 1003, in get_inv_matvec
return SpLuInv(M).matvec
File "/usr/local/lib/python2.6/dist-packages/sklearn/utils/arpack.py",
line 896, in __init__
self.M_lu = splu(M)
File "/usr/local/lib/python2.6/dist-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 173, in splu
ilu=False, options=_options)
RuntimeError: Factor is exactly singular
Alejandro
------------------------------------------------------------------------------
RSA(R) Conference 2012
Save \$700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Matthieu Brucher
2011-11-10 07:31:01 UTC
Hi Jacob,

Indeed, Isomap is a metric MDS, so you have the same hypothesis. A negative
eigenvalue should not happen, but one never knows.
As the eigenvalue only plays as a scaling factor, it is not too weird too
use a negative one in the embedding construction.

Cheers,

Matthieu
Post by Jacob VanderPlas
http://people.kyb.tuebingen.mpg.de/lcayton/resexam.pdf
MDS (on which Isomap is based) assumes a Euclidean distance matrix,
which can be shown to always yield a positive semidefinite kernel. In
the case of Isomap, the distance matrix is not Euclidean in general, and
this can be fixed by ignoring any eigenvectors associated with negative
eigenvalues.
I think, based on this, that KernelPCA is correct as written, except
that the arpack method should use which='LA' rather than which='LM'
(thus ignoring any negative eigenvalues). This would fix Alejandro's
problem. I'll make the change in master.
Thanks for the detail & example code in your question Alejandro - it
made it very easy to track down this bug.
Jake
Post by Jacob VanderPlas
I looked closer: turns out arpack is actually up-to-date.
I think the bug is in the kernel pca code: eigsh should be called with
keyword which='LA' rather than which='LM'. The fit_transform routine
was finding three vectors, and then removing the one with a negative
eigenvalue.
Before making this change, I want to understand what's going on. Does
anybody know if kernel PCA makes any assumptions about kernel form? I
know the kernel must be symmetric, but does the algorithm assume it's
positive (semi) definite?
Jake
Post by Jacob VanderPlas
Alejandro,
It looks like the problem can be traced back to the ARPACK
eigensolver. If you run the code with eigen_solver='dense', it works
as expected. Sometimes arpack does not converge to all the requested
eigenvalues, and I guess there's no error reported when that happens.
I tried performing the eigenvalue decomposition using the scipy
development version of arpack, and it gives 3 dimensions as
expected. It may be that we can fix this by updating the arpack
wrapper from scipy.
Jake
Post by Alejandro Weinstein
I am observing an unexpected behavior of Isomap, related to the
dimensions of the transformed data. If I generate random data, say
1000 points each with dimension 10, and fit a transform using as a
parameter out_dim=3, the fitted data has dimension (1000, 3), as
expected. However, when I repeat the same steps but this time using my
data set consisting of 427 points, each of dimension 400, the fitted
data has dimension (427, 2), i.e., the output dimension is 1 less than
out_dim. Using LLE with the same data set and parameters, the fitted
data has the expected dimension (427, 3).
#############################################
import numpy as np
from sklearn import manifold
n = 1000;
m = 10;
X = np.random.rand(n,m)
n_neighbors = 5
out_dim = 3
Y = manifold.Isomap(n_neighbors, out_dim).fit_transform(X)
print 'Using random data and Isomap'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)
Y = manifold.Isomap(n_neighbors, out_dim).fit_transform(X)
print
print 'Using the data X.npy and Isomap'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)
Y = manifold.LocallyLinearEmbedding(n_neighbors,
out_dim).fit_transform(X)
print
print 'Using the data X.npy and LLE'
print 'X shape:%s, out_dim:%d, Y shape: %s' % (X.shape, out_dim, Y.shape)
##################################################################
Using random data and Isomap
X shape:(1000, 10), out_dim:3, Y shape: (1000, 3)
Using the data X.npy and Isomap
X shape:(427, 400), out_dim:3, Y shape: (427, 2)
Using the data X.npy and LLE
X shape:(427, 400), out_dim:3, Y shape: (427, 3)
The code and the data set is available at
https://github.com/aweinstein/scrapcode
In case it is relevant, the data set consist of documents represented
in the Latent Semantic Analysis space.
Is this the expected behavior of Isomap, or is there something wrong?
Alejandro.
------------------------------------------------------------------------------
Post by Jacob VanderPlas
Post by Jacob VanderPlas
Post by Alejandro Weinstein
RSA(R) Conference 2012
Save \$700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
RSA(R) Conference 2012
Save \$700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com