Discussion:
[Scikit-learn-general] Kernel PCA .fit() Failing Silently
Stephen O'Neill
2015-03-23 18:14:38 UTC
Permalink
Hi Sklearn,

I'm using Kernel PCA with the rbf kernel for projecting data into 3
dimensions for viewing alongside normal PCA and a stereographic projection
class that I wrote myself. Both the PCA and SGP classes seem to be
functioning correctly on this data set, but when I get to the .fit() method
for the KPCA class it fails silently and raises no exception and I have no
idea why.

My code looks something like this:
from sklearn.decomposition import PCA, KernelPCA
transformer = KernelPCA(n_components=3, kernel='rbf')
print transformer
transformer.fit(data)
print "DONE"

Obviously it never outputs "DONE", but the transformer output is:
KernelPCA(alpha=1.0, coef0=1, degree=3, eigen_solver='auto',
fit_inverse_transform=False, gamma=None, kernel='rbf', kernel_params=None,
max_iter=None, n_components=3, remove_zero_eig=False, tol=0)

Any ideas?

Best,
Steve O'Neill
Andreas Mueller
2015-03-23 18:26:15 UTC
Permalink
Hi Steve.
So by failing, you mean it never finishes?
Or the python process dies?

What is the shape of your data?

Andy
Post by Stephen O'Neill
Hi Sklearn,
I'm using Kernel PCA with the rbf kernel for projecting data into 3
dimensions for viewing alongside normal PCA and a stereographic
projection class that I wrote myself. Both the PCA and SGP classes
seem to be functioning correctly on this data set, but when I get to
the .fit() method for the KPCA class it fails silently and raises no
exception and I have no idea why.
from sklearn.decomposition import PCA, KernelPCA
transformer = KernelPCA(n_components=3, kernel='rbf')
print transformer
transformer.fit(data)
print "DONE"
KernelPCA(alpha=1.0, coef0=1, degree=3, eigen_solver='auto',
fit_inverse_transform=False, gamma=None, kernel='rbf',
kernel_params=None, max_iter=None, n_components=3,
remove_zero_eig=False, tol=0)
Any ideas?
Best,
Steve O'Neill
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Stephen O'Neill
2015-03-25 17:22:38 UTC
Permalink
Hey Andy,

Sorry, yes, by failing I mean it never finishes, and the python process
dies without raising any exceptions.

The shape of the data is (46196,114).
Also numpy.all(numpy.isfinite(my_data)) returns True before I call
transformer.fit()

I'm running on python 2.7.8 numpy 1.9.1 sklearn 0.15.2 on a mac OSX 10.9.5.

Best,
Steve O'Neill
Andreas Mueller
2015-03-25 18:45:34 UTC
Permalink
Hi Steve.
Can you monitor the RAM usage before it fails?
Because of the complexity of the algorithm, and as we don't truncate the
rbf kernel, this will take 16GB of ram.
If the process starts swapping, your OS might just kill it. There is
nothing much we can do about that.

A solution to running KPCA on your data would be to either use the
Nystroem or RBFSampler kernel approximation and a normal PCA.

Thanks,
Andy
Post by Stephen O'Neill
Hey Andy,
Sorry, yes, by failing I mean it never finishes, and the python
process dies without raising any exceptions.
The shape of the data is (46196,114).
Also numpy.all(numpy.isfinite(my_data)) returns True before I call
transformer.fit()
I'm running on python 2.7.8 numpy 1.9.1 sklearn 0.15.2 on a mac OSX 10.9.5.
Best,
Steve O'Neill
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Stephen O'Neill
2015-03-25 19:38:59 UTC
Permalink
Hey Andy,

Hmmm, that might be it. My machine only has 8GB of RAM - why didn't I
think of that? Indeed the RAM usage seems to have pretty large
fluctuations for the process, and when I re-run now instead of just
silently dying its choking up my whole computer - indicative of a RAM issue.

Thank you very much - and sorry for the false alarm.

Best,
Steve O'Neill
Andreas Mueller
2015-03-25 19:56:18 UTC
Permalink
It would be nice to do something else instead of crash and burn, but for
the moment that's on the user.

Well the kernel approximation should make it work. If you are after
visualization I'd also recommend the T-SNE
from this branch:
https://github.com/scikit-learn/scikit-learn/pull/4025
Post by Stephen O'Neill
Hey Andy,
Hmmm, that might be it. My machine only has 8GB of RAM - why didn't I
think of that? Indeed the RAM usage seems to have pretty large
fluctuations for the process, and when I re-run now instead of just
silently dying its choking up my whole computer - indicative of a RAM issue.
Thank you very much - and sorry for the false alarm.
Best,
Steve O'Neill
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Sebastian Raschka
2015-03-25 20:40:25 UTC
Permalink
Hi,

I think some memory monitoring/warning stuff would be very helpful in general. As far as I know, memory usage via e.g,. psutil is not supported by every OS or machine, but we could add an optional "monitor_memory" parameter to estimators/transformers like

SomeEstimator(..., monitor_memory=True)

and inside initialize something like

if self.monitor:
import psutil
self.process = psutil.Process()

that gets updated and checked every few iterations

cpu_total = self.process.get_cpu_percent()
mem_total = self.process.get_memory_percent()

if mem_total > 0.9:
print('warning, high memory usage')

What do you think?

Also, for the "fit" of estimators, an optional progress bar iterator could be printed, e.g., each update step could be an epoch in an SVM or so. E.g., something like this: https://github.com/rasbt/pyprind

Best,
Sebastian
It would be nice to do something else instead of crash and burn, but for the moment that's on the user.
Well the kernel approximation should make it work. If you are after visualization I'd also recommend the T-SNE
https://github.com/scikit-learn/scikit-learn/pull/4025
Post by Stephen O'Neill
Hey Andy,
Hmmm, that might be it. My machine only has 8GB of RAM - why didn't I think of that? Indeed the RAM usage seems to have pretty large fluctuations for the process, and when I re-run now instead of just silently dying its choking up my whole computer - indicative of a RAM issue.
Thank you very much - and sorry for the false alarm.
Best,
Steve O'Neill
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.
http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas Mueller
2015-03-25 20:51:21 UTC
Permalink
Implementing this directly in the estimators seems very messy.
If we had decent logging, we could try that. Unfortunately we don't.
Pretty printing could also be achieved via a logging mechanism, so that
people could define it themselves.
I don't think it is something we necessary want to provide.
You don't want to pass a callback to a python function to the inner loop
of libsvm, I think (there is a priority queue of pairs to pick, no
epochs, right?)
Post by Sebastian Raschka
Hi,
I think some memory monitoring/warning stuff would be very helpful in general. As far as I know, memory usage via e.g,. psutil is not supported by every OS or machine, but we could add an optional "monitor_memory" parameter to estimators/transformers like
SomeEstimator(..., monitor_memory=True)
and inside initialize something like
import psutil
self.process = psutil.Process()
that gets updated and checked every few iterations
cpu_total = self.process.get_cpu_percent()
mem_total = self.process.get_memory_percent()
print('warning, high memory usage')
What do you think?
Also, for the "fit" of estimators, an optional progress bar iterator could be printed, e.g., each update step could be an epoch in an SVM or so. E.g., something like this: https://github.com/rasbt/pyprind
Best,
Sebastian
It would be nice to do something else instead of crash and burn, but for the moment that's on the user.
Well the kernel approximation should make it work. If you are after visualization I'd also recommend the T-SNE
https://github.com/scikit-learn/scikit-learn/pull/4025
Post by Stephen O'Neill
Hey Andy,
Hmmm, that might be it. My machine only has 8GB of RAM - why didn't I think of that? Indeed the RAM usage seems to have pretty large fluctuations for the process, and when I re-run now instead of just silently dying its choking up my whole computer - indicative of a RAM issue.
Thank you very much - and sorry for the false alarm.
Best,
Steve O'Neill
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.
http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2015-03-26 06:56:35 UTC
Permalink
It would be nice to do something else instead of crash and burn, but for the
moment that's on the user.
I think that in recent Python versions segfault can be captured.

Loading...