[Scikit-learn-general] Kernel PCA .fit() Failing Silently

Discussion:

Stephen O'Neill

2015-03-23 18:14:38 UTC

Hi Sklearn,

I'm using Kernel PCA with the rbf kernel for projecting data into 3
dimensions for viewing alongside normal PCA and a stereographic projection
class that I wrote myself. Both the PCA and SGP classes seem to be
functioning correctly on this data set, but when I get to the .fit() method
for the KPCA class it fails silently and raises no exception and I have no
idea why.

My code looks something like this:
from sklearn.decomposition import PCA, KernelPCA
transformer = KernelPCA(n_components=3, kernel='rbf')
print transformer
transformer.fit(data)
print "DONE"

Obviously it never outputs "DONE", but the transformer output is:
KernelPCA(alpha=1.0, coef0=1, degree=3, eigen_solver='auto',
fit_inverse_transform=False, gamma=None, kernel='rbf', kernel_params=None,
max_iter=None, n_components=3, remove_zero_eig=False, tol=0)

Any ideas?

Best,
Steve O'Neill

Andreas Mueller

2015-03-23 18:26:15 UTC

Permalink

Hi Steve.
So by failing, you mean it never finishes?
Or the python process dies?

What is the shape of your data?

Andy

Post by Stephen O'Neill
Hi Sklearn,
I'm using Kernel PCA with the rbf kernel for projecting data into 3
dimensions for viewing alongside normal PCA and a stereographic
projection class that I wrote myself. Both the PCA and SGP classes
seem to be functioning correctly on this data set, but when I get to
the .fit() method for the KPCA class it fails silently and raises no
exception and I have no idea why.
from sklearn.decomposition import PCA, KernelPCA
transformer = KernelPCA(n_components=3, kernel='rbf')
print transformer
transformer.fit(data)
print "DONE"
KernelPCA(alpha=1.0, coef0=1, degree=3, eigen_solver='auto',
fit_inverse_transform=False, gamma=None, kernel='rbf',
kernel_params=None, max_iter=None, n_components=3,
remove_zero_eig=False, tol=0)
Any ideas?
Best,
Steve O'Neill
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Stephen O'Neill

2015-03-25 17:22:38 UTC

Permalink

Hey Andy,

Sorry, yes, by failing I mean it never finishes, and the python process
dies without raising any exceptions.

The shape of the data is (46196,114).
Also numpy.all(numpy.isfinite(my_data)) returns True before I call
transformer.fit()

I'm running on python 2.7.8 numpy 1.9.1 sklearn 0.15.2 on a mac OSX 10.9.5.

Best,
Steve O'Neill

Andreas Mueller

2015-03-25 18:45:34 UTC

Permalink

Hi Steve.
Can you monitor the RAM usage before it fails?
Because of the complexity of the algorithm, and as we don't truncate the
rbf kernel, this will take 16GB of ram.
If the process starts swapping, your OS might just kill it. There is
nothing much we can do about that.

A solution to running KPCA on your data would be to either use the
Nystroem or RBFSampler kernel approximation and a normal PCA.

Thanks,
Andy

Post by Stephen O'Neill
Hey Andy,
Sorry, yes, by failing I mean it never finishes, and the python
process dies without raising any exceptions.
The shape of the data is (46196,114).
Also numpy.all(numpy.isfinite(my_data)) returns True before I call
transformer.fit()
I'm running on python 2.7.8 numpy 1.9.1 sklearn 0.15.2 on a mac OSX 10.9.5.
Best,
Steve O'Neill
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Stephen O'Neill

2015-03-25 19:38:59 UTC

Permalink

Hey Andy,

Hmmm, that might be it. My machine only has 8GB of RAM - why didn't I
think of that? Indeed the RAM usage seems to have pretty large
fluctuations for the process, and when I re-run now instead of just
silently dying its choking up my whole computer - indicative of a RAM issue.

Thank you very much - and sorry for the false alarm.

Best,
Steve O'Neill

Andreas Mueller

2015-03-25 19:56:18 UTC

Permalink

It would be nice to do something else instead of crash and burn, but for
the moment that's on the user.

Well the kernel approximation should make it work. If you are after
visualization I'd also recommend the T-SNE
from this branch:
https://github.com/scikit-learn/scikit-learn/pull/4025

Post by Stephen O'Neill
Hey Andy,
Hmmm, that might be it. My machine only has 8GB of RAM - why didn't I
think of that? Indeed the RAM usage seems to have pretty large
fluctuations for the process, and when I re-run now instead of just
silently dying its choking up my whole computer - indicative of a RAM issue.
Thank you very much - and sorry for the false alarm.
Best,
Steve O'Neill
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Sebastian Raschka

2015-03-25 20:40:25 UTC

Permalink

Hi,

I think some memory monitoring/warning stuff would be very helpful in general. As far as I know, memory usage via e.g,. psutil is not supported by every OS or machine, but we could add an optional "monitor_memory" parameter to estimators/transformers like

SomeEstimator(..., monitor_memory=True)

and inside initialize something like

if self.monitor:
import psutil
self.process = psutil.Process()

that gets updated and checked every few iterations

cpu_total = self.process.get_cpu_percent()
mem_total = self.process.get_memory_percent()

if mem_total > 0.9:
print('warning, high memory usage')

What do you think?

Also, for the "fit" of estimators, an optional progress bar iterator could be printed, e.g., each update step could be an epoch in an SVM or so. E.g., something like this: https://github.com/rasbt/pyprind

Best,
Sebastian

It would be nice to do something else instead of crash and burn, but for the moment that's on the user.
Well the kernel approximation should make it work. If you are after visualization I'd also recommend the T-SNE
https://github.com/scikit-learn/scikit-learn/pull/4025

Post by Stephen O'Neill
Hey Andy,
Hmmm, that might be it. My machine only has 8GB of RAM - why didn't I think of that? Indeed the RAM usage seems to have pretty large fluctuations for the process, and when I re-run now instead of just silently dying its choking up my whole computer - indicative of a RAM issue.
Thank you very much - and sorry for the false alarm.
Best,
Steve O'Neill
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.
http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2015-03-25 20:51:21 UTC

Permalink

Implementing this directly in the estimators seems very messy.
If we had decent logging, we could try that. Unfortunately we don't.
Pretty printing could also be achieved via a logging mechanism, so that
people could define it themselves.
I don't think it is something we necessary want to provide.
You don't want to pass a callback to a python function to the inner loop
of libsvm, I think (there is a priority queue of pairs to pick, no
epochs, right?)

Post by Sebastian Raschka
Hi,
I think some memory monitoring/warning stuff would be very helpful in general. As far as I know, memory usage via e.g,. psutil is not supported by every OS or machine, but we could add an optional "monitor_memory" parameter to estimators/transformers like
SomeEstimator(..., monitor_memory=True)
and inside initialize something like
import psutil
self.process = psutil.Process()
that gets updated and checked every few iterations
cpu_total = self.process.get_cpu_percent()
mem_total = self.process.get_memory_percent()
print('warning, high memory usage')
What do you think?
Also, for the "fit" of estimators, an optional progress bar iterator could be printed, e.g., each update step could be an epoch in an SVM or so. E.g., something like this: https://github.com/rasbt/pyprind
Best,
Sebastian

It would be nice to do something else instead of crash and burn, but for the moment that's on the user.
Well the kernel approximation should make it work. If you are after visualization I'd also recommend the T-SNE
https://github.com/scikit-learn/scikit-learn/pull/4025

Post by Stephen O'Neill
Hey Andy,
Hmmm, that might be it. My machine only has 8GB of RAM - why didn't I think of that? Indeed the RAM usage seems to have pretty large fluctuations for the process, and when I re-run now instead of just silently dying its choking up my whole computer - indicative of a RAM issue.
Thank you very much - and sorry for the false alarm.
Best,
Steve O'Neill
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.
http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gael Varoquaux

2015-03-26 06:56:35 UTC

Permalink

It would be nice to do something else instead of crash and burn, but for the
moment that's on the user.

I think that in recent Python versions segfault can be captured.