Discussion:
Variance scaling / whitening now by default in the PCA module and improvement in face recognition problem
(too old to reply)
Olivier Grisel
2010-12-07 20:33:08 UTC
Permalink
Hi all,

I noticed that the PCA module did not divide by S hence the projected
signal still had varying component-wise variances. I have added a
"whiten" boolean parameter to PCA to do just that. whiten is now True
by default. You can set it explicitly to False to get the old
behavior.

I have also updated the hyperparameters of the SVM in the face
recognition module and I know get 0.88 as F1 score on my 5
people-closed-world face recognition example (using 150 eigenfaces
without any fancy alignment)!

https://github.com/scikit-learn/scikit-learn/commit/ea494a0a275a8b5660fd479a6edab5c0127b2fab

Open questions:

- maybe "scale" would be a better name for that parameter?

- the variance of the output is not always 1.0: in the test case I
wrote all outpout components have variance 0.1. Is that a result of
taking only the top n_comp singular values and ignoring the remaining
dimensions? Should we further scale the output to get 1.0 instead?

- fastica is doing whitening using by computing the empirical
covariance matrix and taking the 1 / square root of the eigen vectors.
Wouldn't SVD be better/faster there? Would it make sense to try an
factorize both implementation (pca and fastica whitening), or it is
better like that? If so why?
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2010-12-07 22:16:36 UTC
Permalink
Post by Olivier Grisel
I noticed that the PCA module did not divide by S hence the projected
signal still had varying component-wise variances. I have added a
"whiten" boolean parameter to PCA to do just that. whiten is now True
by default. You can set it explicitly to False to get the old
behavior.
I think that this is a very useful argument. I am a bit surprised that
you put it put it true by default, though. Whitening means that you throw
away a lot of the information. I would not expect a PCA to do this byt
default. What would other expect?
Post by Olivier Grisel
- maybe "scale" would be a better name for that parameter?
I like 'whiten'.
Post by Olivier Grisel
- the variance of the output is not always 1.0: in the test case I
wrote all outpout components have variance 0.1. Is that a result of
taking only the top n_comp singular values and ignoring the remaining
dimensions? Should we further scale the output to get 1.0 instead?
Isn't there a sqrt(n_sample) factor: difference between sum of squares,
and norm 2.
Post by Olivier Grisel
- fastica is doing whitening using by computing the empirical
covariance matrix and taking the 1 / square root of the eigen vectors.
Wouldn't SVD be better/faster there? Would it make sense to try an
factorize both implementation (pca and fastica whitening), or it is
better like that? If so why?
I kinda like to keep some whitening in fastica, so that the fastica
function is still usable without going through the OOP framework. A
little bit of redundency is not the end of the world, IMHO.

My 2 canadian cents,

Gaël
Alexandre Gramfort
2010-12-07 22:31:09 UTC
Permalink
Post by Gael Varoquaux
I think that this is a very useful argument. I am a bit surprised that
you put it put it true by default, though. Whitening means that you throw
away a lot of the information. I would not expect a PCA to do this byt
default. What would other expect?
+1
Post by Gael Varoquaux
Post by Olivier Grisel
- maybe "scale" would be a better name for that parameter?
I like 'whiten'.
+1

Alex
Olivier Grisel
2010-12-07 23:26:02 UTC
Permalink
Post by Gael Varoquaux
Post by Olivier Grisel
I noticed that the PCA module did not divide by S hence the projected
signal still had varying component-wise variances. I have added a
"whiten" boolean parameter to PCA to do just that. whiten is now True
by default. You can set it explicitly to False to get the old
behavior.
I think that this is a very useful argument. I am a bit surprised that
you put it put it true by default, though. Whitening means that you throw
away a lot of the information.
Your are not throwing away that much information: just the shape of
the singular spectrum. If n_comp is fixed by the user (can depend on
n_features and the nature of the problem but probably not much if any
on n_samples) that means throwing away a constant n_comp *
sizeof(np.double) bits of data. That sounds ridiculously small
compared to the n_samples * n_comp * sizeof(np.double) bits of raw
data you still handle after the PCA.

On the user hand, by scaling the output variance you get a dataset
that is better behaved w.r.t. the machine learning models you plug
downstream, e.g:

- for SVM with gaussian kernels you make a strong assumption that the
data is isotropic: the radius of the gaussian ball of the svm kernel
is the same in all directions and cannot be adapted by the SVM model
while learning

- the same remark holds for other RBF based models such as kmeans (and
sigmoid activated neural networks which we don't have in scikit-learn
so far / anymore)

- regularized classifiers / regressors also often make the same strong
isotropy assumption in there penalty term: l2 or l1 has the same
strength in all directions. If the regularization is strong I guess
that that the impact of this assumption is not negligible in practice.

Hence I think that we should scale the variance of the PCA output by
default since in the context of scikit-learn is primary meant as a
feature extractor for the downstream modules which almost always
behave better when the data is normalized variance-wise.

The accuracy boost on the faces example seems to demonstrate this.
Probably other examples might benefit from this. I will try to cluster
whitened TF-IDF documents at some point.

And even if you use PCA for visualization (2D or 3D projection),
scaled axis make sense as well as a reasonable default (even though
matplotib will do it for your if it's not the case).
Post by Gael Varoquaux
I would not expect a PCA to do this byt
default. What would other expect?
Any second thought after the above remarks?

If you and Alex it or others add or maintain your +1 on False by
default I will change it off-course. Maybe we need more empirical
evidence to take an informed decision.
Post by Gael Varoquaux
Post by Olivier Grisel
- maybe "scale" would be a better name for that parameter?
I like 'whiten'.
Ok let's keep it like this.
Post by Gael Varoquaux
Post by Olivier Grisel
- the variance of the output is not always 1.0: in the test case I
wrote all outpout components have variance 0.1. Is that a result of
taking only the top n_comp singular values and ignoring the remaining
dimensions? Should we further scale the output to get 1.0 instead?
Isn't there a sqrt(n_sample) factor: difference between sum of squares,
and norm 2.
That seems consistent with the test values. I need to check the math
accordingly.
Post by Gael Varoquaux
Post by Olivier Grisel
- fastica is doing whitening using by computing the empirical
covariance matrix and taking the 1 / square root of the eigen vectors.
I meant values, not vectors.
Post by Gael Varoquaux
Post by Olivier Grisel
Wouldn't SVD be better/faster there? Would it make sense to try an
factorize both implementation (pca and fastica whitening), or it is
better like that? If so why?
I kinda like to keep some whitening in fastica, so that the fastica
function is still usable without going through the OOP framework. A
little bit of redundency is not the end of the world, IMHO.
Alright, let us keep like this as it is. Any thought about the
linalg.eigh(np.dot(X, X.T)) vs linalg.svd(X) issue? Which one is
supposed to the fastest / more robust? (Just out of curiosity, I am
too lazy to google it :)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2010-12-08 05:33:13 UTC
Permalink
Post by Olivier Grisel
Post by Gael Varoquaux
Post by Olivier Grisel
I noticed that the PCA module did not divide by S hence the projected
signal still had varying component-wise variances. I have added a
"whiten" boolean parameter to PCA to do just that. whiten is now True
by default. You can set it explicitly to False to get the old
behavior.
I think that this is a very useful argument. I am a bit surprised that
you put it put it true by default, though. Whitening means that you throw
away a lot of the information.
Your are not throwing away that much information: just the shape of
the singular spectrum. If n_comp is fixed by the user (can depend on
n_features and the nature of the problem but probably not much if any
on n_samples) that means throwing away a constant n_comp *
sizeof(np.double) bits of data. That sounds ridiculously small
compared to the n_samples * n_comp * sizeof(np.double) bits of raw
data you still handle after the PCA.
You're throw out explained variance, which is very dangerous from the
statistical point of view: you are giving the same importance to
components contributing a lot to the signal, or not at all.
Post by Olivier Grisel
On the user hand, by scaling the output variance you get a dataset
that is better behaved w.r.t. the machine learning models you plug
Post by Gael Varoquaux
I would not expect a PCA to do this byt
default. What would other expect?
Any second thought after the above remarks?
No, I don't agree with you. It doesn't match my experience. This probably
means that on the kind of data that you work with, the variance is not
important whereas on the kind of data I work with, it is. As often we
judge what is 'natural' by what we most often do :).
Post by Olivier Grisel
Alright, let us keep like this as it is. Any thought about the
linalg.eigh(np.dot(X, X.T)) vs linalg.svd(X) issue? Which one is
supposed to the fastest / more robust? (Just out of curiosity, I am
too lazy to google it :)
In my experience, there is no value in doing the eigh: the svd is as
fast, and it obfuscats the code.

Gaël
Bertrand Thirion
2010-12-08 08:34:20 UTC
Permalink
Hi Olivier,

I tend to disagree: in terms of information compression, the bits corresponding to the spectrum are essential: if you assume that your data is Gaussian, the Kullback-Leibler divergence between two models with the different spectra can be made arbitrarily large, while making some approximation on the eigenvectors is not that important (provided that their direction is more or less preserved, as measured by correlation).

In terms of signal representation, I think that the right view on PCA is to provide a reduced rank representation of the data, i.e. perform the projection into an optimal subspace, which is differnt from whitening: PCA minimizes the reconstruction error, while whitening can induce arbitrarily large distortion. As you point out correctly, whitening of spherical projection is useful in many applications, but I do not think that this should be the default behaviour for PCA.

2 eurocents,

Bertrand
Post by Olivier Grisel
Your are not throwing away that much information: just the shape of
the singular spectrum. If n_comp is fixed by the user (can depend on
n_features and the nature of the problem but probably not much if any
on n_samples) that means throwing away a constant n_comp *
sizeof(np.double) bits of data. That sounds ridiculously small
compared to the n_samples * n_comp * sizeof(np.double) bits of raw
data you still handle after the PCA.
On the user hand, by scaling the output variance you get a dataset
that is better behaved w.r.t. the machine learning models you plug
- for SVM with gaussian kernels you make a strong assumption that the
data is isotropic: the radius of the gaussian ball of the svm kernel
is the same in all directions and cannot be adapted by the SVM model
while learning
- the same remark holds for other RBF based models such as kmeans (and
sigmoid activated neural networks which we don't have in scikit-learn
so far / anymore)
- regularized classifiers / regressors also often make the same strong
isotropy assumption in there penalty term: l2 or l1 has the same
strength in all directions. If the regularization is strong I guess
that that the impact of this assumption is not negligible in practice.
Hence I think that we should scale the variance of the PCA output by
default since in the context of scikit-learn is primary meant as a
feature extractor for the downstream modules which almost always
behave better when the data is normalized variance-wise.
The accuracy boost on the faces example seems to demonstrate this.
Probably other examples might benefit from this. I will try to cluster
whitened TF-IDF documents at some point.
And even if you use PCA for visualization (2D or 3D projection),
scaled axis make sense as well as a reasonable default (even though
matplotib will do it for your if it's not the case).
Post by Gael Varoquaux
I would not expect a PCA to do this byt
default. What would other expect?
Any second thought after the above remarks?
If you and Alex it or others add or maintain your +1 on False by
default I will change it off-course. Maybe we need more empirical
evidence to take an informed decision.
Post by Gael Varoquaux
Post by Olivier Grisel
- maybe "scale" would be a better name for that parameter?
I like 'whiten'.
Ok let's keep it like this.
Post by Gael Varoquaux
Post by Olivier Grisel
- the variance of the output is not always 1.0: in the test case I
wrote all outpout components have variance 0.1. Is that a result of
taking only the top n_comp singular values and ignoring the
remaining
dimensions? Should we further scale the output to get 1.0 instead?
Isn't there a sqrt(n_sample) factor: difference between sum of
squares,
and norm 2.
That seems consistent with the test values. I need to check the math
accordingly.
Post by Gael Varoquaux
Post by Olivier Grisel
- fastica is doing whitening using by computing the empirical
covariance matrix and taking the 1 / square root of the eigen
vectors.
I meant values, not vectors.
Post by Gael Varoquaux
Post by Olivier Grisel
Wouldn't SVD be better/faster there? Would it make sense to try an
factorize both implementation (pca and fastica whitening), or it is
better like that? If so why?
I kinda like to keep some whitening in fastica, so that the fastica
function is still usable without going through the OOP framework. A
little bit of redundency is not the end of the world, IMHO.
Alright, let us keep like this as it is. Any thought about the
linalg.eigh(np.dot(X, X.T)) vs linalg.svd(X) issue? Which one is
supposed to the fastest / more robust? (Just out of curiosity, I am
too lazy to google it :)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
What happens now with your Lotus Notes apps - do you make another
costly
upgrade, or settle for being marooned without product support? Time to
move
off Lotus Notes and onto the cloud with Force.com, apps are easier to
build,
use, and manage than apps on traditional platforms. Sign up for the
Lotus
Notes Migration Kit to learn more. http://p.sf.net/sfu/salesforce-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2010-12-08 10:27:41 UTC
Permalink
   Hi Olivier,
I tend to disagree: in terms of information compression, the bits corresponding to the spectrum are essential: if you assume that your data is Gaussian, the Kullback-Leibler divergence between two models with the different spectra can be made arbitrarily large, while making some approximation on the eigenvectors is not that important (provided that their direction is more or less preserved, as measured by correlation).
In terms of signal representation, I think that the right view on PCA is to provide a reduced rank representation of the data, i.e. perform the projection into an optimal subspace, which is differnt from whitening: PCA minimizes the reconstruction error, while whitening can induce arbitrarily large distortion. As you point out correctly, whitening of spherical projection is useful in many applications, but I do not think that this should be the default behaviour for PCA.
Ok I agree that the PCA class should do by default what the PCA does
in the literature. I have pushed a fix to set whiten=False by default
and improved the docstring to give some hints on why/when it might be
useful to set it to True in practice.

I have also made the faces recognition example use a grid search for
the hyperparams of the SVM using only info from the training set to be
able to compare whiten=True to whiten=False in a more principled way
(without setting the hyperparam by hand by looking at the test set :).
And it confirms by intuition (on this example at least):

- whitout whitening, the RBF SVM has a f1-score of 0.80
- with whitening, the the RBF SVM has a f1-score of 0.85

I suspect those numbers to be stable by +/- 0.01 when shuffling the
train / test sets differently with different random seeds but I am to
lazy to update the example to make this confidence interval evaluation
computed explicitly.

Would be interesting to evaluate the impact of whitening on a text
classification task with an elastic net regularized linear
classifier.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Loading...