Olivier Grisel

2010-11-29 04:09:41 UTC

Hi again,

I finally found the time to finish watching Gunnar Martinsson's NIPS

tutorial on videolecture.net and the fast_svd method is indeed able to

recover a fairly accurate variant of the singular vectors even if the

data is not low rank (as this is often the case in practice) as long

as we perform a couple of power iteration steps (q = 3 sounds like a

good default parameter).

Furthermore in scikit-learn we don't really care of the singular

values / vectors are exact to the 7th decimal. We are mostly using PCA

/ SVD as a feature extractor / normalizer. Hence I think we could make

the PCA class use the approximate randomized method by default.

We need to investigate further on the kmeans SVD seeding strategy that

could get a really great boost from this method too when k is small.

Also NNMF can be seeded by a 2 SVDs:

http://www.cs.rpi.edu/~boutsc/files/nndsvd.pdf hence it might be

possible to make NNMF fast by combining both strategies (even though

it is probably not as interesting as the state of the art online

dictionary learning stuff).

Vlad, if you decide to start working on sparse PCA, NNMF and friends

be sure to familiarize yourself with this technique first:

http://videolectures.net/nips09_martinsson_mvll/ (this tutorial is a

bit long but excellent)

The manifold module might also benefit from this: if I understand

correctly some of those algorithms are based on SVDs of non linear

similarity matrices of the raw data (I am not sure though).

The spanning col / rows extraction (a.k.a skeleton summary extraction)

mentioned in the tutorial is another really interesting unsupervised

algorithms that would be very interesting to have in the scikit.

Good night (whatever your timezone is :)

---------- Forwarded message ----------

From: <***@github.com>

Date: 2010/11/29

Subject: [Scikit-learn-commits] [scikit-learn/scikit-learn] cd8c6b:

one more test for SVD

To: scikit-learn-***@lists.sourceforge.net

Branch: refs/heads/master

Home: https://github.com/scikit-learn/scikit-learn

Commit: cd8c6b00d390b61aaa7d6fd7a391c128cf132e42

https://github.com/scikit-learn/scikit-learn/commit/cd8c6b00d390b61aaa7d6fd7a391c128cf132e42

Author: Olivier Grisel <***@ensta.org>

Date: 2010-11-28 (Sun, 28 Nov 2010)

Changed paths:

M scikits/learn/utils/tests/test_svd.py

Log Message:

-----------

one more test for SVD

I finally found the time to finish watching Gunnar Martinsson's NIPS

tutorial on videolecture.net and the fast_svd method is indeed able to

recover a fairly accurate variant of the singular vectors even if the

data is not low rank (as this is often the case in practice) as long

as we perform a couple of power iteration steps (q = 3 sounds like a

good default parameter).

Furthermore in scikit-learn we don't really care of the singular

values / vectors are exact to the 7th decimal. We are mostly using PCA

/ SVD as a feature extractor / normalizer. Hence I think we could make

the PCA class use the approximate randomized method by default.

We need to investigate further on the kmeans SVD seeding strategy that

could get a really great boost from this method too when k is small.

Also NNMF can be seeded by a 2 SVDs:

http://www.cs.rpi.edu/~boutsc/files/nndsvd.pdf hence it might be

possible to make NNMF fast by combining both strategies (even though

it is probably not as interesting as the state of the art online

dictionary learning stuff).

Vlad, if you decide to start working on sparse PCA, NNMF and friends

be sure to familiarize yourself with this technique first:

http://videolectures.net/nips09_martinsson_mvll/ (this tutorial is a

bit long but excellent)

The manifold module might also benefit from this: if I understand

correctly some of those algorithms are based on SVDs of non linear

similarity matrices of the raw data (I am not sure though).

The spanning col / rows extraction (a.k.a skeleton summary extraction)

mentioned in the tutorial is another really interesting unsupervised

algorithms that would be very interesting to have in the scikit.

Good night (whatever your timezone is :)

---------- Forwarded message ----------

From: <***@github.com>

Date: 2010/11/29

Subject: [Scikit-learn-commits] [scikit-learn/scikit-learn] cd8c6b:

one more test for SVD

To: scikit-learn-***@lists.sourceforge.net

Branch: refs/heads/master

Home: https://github.com/scikit-learn/scikit-learn

Commit: cd8c6b00d390b61aaa7d6fd7a391c128cf132e42

https://github.com/scikit-learn/scikit-learn/commit/cd8c6b00d390b61aaa7d6fd7a391c128cf132e42

Author: Olivier Grisel <***@ensta.org>

Date: 2010-11-28 (Sun, 28 Nov 2010)

Changed paths:

M scikits/learn/utils/tests/test_svd.py

Log Message:

-----------

one more test for SVD

--

Olivier

http://twitter.com/ogrisel - http://github.com/ogrisel

Olivier

http://twitter.com/ogrisel - http://github.com/ogrisel