Discussion:
TSNE Memory Error
(too old to reply)
Jason Wolosonovich
2015-04-17 23:48:33 UTC
Permalink
Hello All,

My dataset has 93 features and just under 62,000 observations (61,878 to be exact). I'm running out of memory right after the mean sigma value is computed/displayed. I've tried using dimensionality reduction via TruncatedSVD with n_components set at different levels (78, 50 and 2 respectively) prior to sending the data to TSNE but I still run out of memory. For TSNE, n_components=2 and perplexity=40 (I've also tried 20). I've got 24GB of RAM on my 64-bit windows 7 machine. Should I try a subsample of the dataset and if so, does anyone have a recommendation on the size? Thanks!



-Jason
afabisch
2015-04-18 16:15:25 UTC
Permalink
Hi Jason,

memory is a problem in our implementation of MNIST. I sent a detailed
list of the required memory to this mailing list some month ago. You can
find it here:

http://sourceforge.net/p/scikit-learn/mailman/message/33090573/

The number of features is irrelevant. Only the number of samples is
important. You have too many samples because the algorithm requires
O(n^2) space (in your case probably about 30 GB). I would not use the
original t-SNE algorithm for this dataset anyway because the complexity
is O(n^2) as well, which means that you would have to wait some days or
weeks for the result.

There is a new pull request that implements Barnes-Hut t-SNE here:

https://github.com/scikit-learn/scikit-learn/pull/4025

The advantage of Barnes-Hut t-SNE in comparison to t-SNE is that you
would have a complexity of O(n log n). However, at the moment the full
distance matrix is still computed so that would not fix your original
problem but I think the memory problem should be solved soon.

In your case you could take half of the dataset. The number of features
is not critical at all. You can take all 93 features without any
dimensionality reduction.

Best regards,

Alexander
Post by Jason Wolosonovich
Hello All,
My dataset has 93 features and just under 62,000 observations (61,878
to be exact). I'm running out of memory right after the mean sigma
value is computed/displayed. I've tried using dimensionality reduction
via TruncatedSVD with n_components set at different levels (78, 50 and
2 respectively) prior to sending the data to TSNE but I still run out
of memory. For TSNE, n_components=2 and perplexity=40 (I've also tried
20). I've got 24GB of RAM on my 64-bit windows 7 machine. Should I try
a subsample of the dataset and if so, does anyone have a
recommendation on the size? Thanks!
-Jason
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live
exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Jason Wolosonovich
2015-04-20 06:55:29 UTC
Permalink
Oh wow, very cool. Thank you very much for the assistance and info Alexander!

-----Original Message-----
From: afabisch [mailto:***@mailhost.informatik.uni-bremen.de]
Sent: Saturday, April 18, 2015 9:15 AM
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] TSNE Memory Error

Hi Jason,

memory is a problem in our implementation of MNIST. I sent a detailed list of the required memory to this mailing list some month ago. You can find it here:

http://sourceforge.net/p/scikit-learn/mailman/message/33090573/

The number of features is irrelevant. Only the number of samples is important. You have too many samples because the algorithm requires
O(n^2) space (in your case probably about 30 GB). I would not use the original t-SNE algorithm for this dataset anyway because the complexity is O(n^2) as well, which means that you would have to wait some days or weeks for the result.

There is a new pull request that implements Barnes-Hut t-SNE here:

https://github.com/scikit-learn/scikit-learn/pull/4025

The advantage of Barnes-Hut t-SNE in comparison to t-SNE is that you would have a complexity of O(n log n). However, at the moment the full distance matrix is still computed so that would not fix your original problem but I think the memory problem should be solved soon.

In your case you could take half of the dataset. The number of features is not critical at all. You can take all 93 features without any dimensionality reduction.

Best regards,

Alexander
Post by Jason Wolosonovich
Hello All,
My dataset has 93 features and just under 62,000 observations (61,878
to be exact). I'm running out of memory right after the mean sigma
value is computed/displayed. I've tried using dimensionality reduction
via TruncatedSVD with n_components set at different levels (78, 50 and
2 respectively) prior to sending the data to TSNE but I still run out
of memory. For TSNE, n_components=2 and perplexity=40 (I've also tried
20). I've got 24GB of RAM on my 64-bit windows 7 machine. Should I try
a subsample of the dataset and if so, does anyone have a
recommendation on the size? Thanks!
-Jason
----------------------------------------------------------------------
-------- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard Learn
Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alexander Fabisch
2015-04-20 07:04:50 UTC
Permalink
Oh, I mean that is a problem of the t-SNE implementation, it is not a
problem of the "MNIST implementation". I don't know how that could
happen. :D
Post by Jason Wolosonovich
Oh wow, very cool. Thank you very much for the assistance and info Alexander!
-----Original Message-----
Sent: Saturday, April 18, 2015 9:15 AM
Subject: Re: [Scikit-learn-general] TSNE Memory Error
Hi Jason,
Jason Wolosonovich
2015-04-20 21:03:49 UTC
Permalink
No worries, I knew what you meant :D Thanks again though, I'm running it now with no memory issues after cutting the sample size in half, I must have misplaced a decimal point when I was trying to calculate how much memory I would need (I had calculated something like 3.24GB) :D

-----Original Message-----
From: Alexander Fabisch [mailto:***@informatik.uni-bremen.de]
Sent: Monday, April 20, 2015 12:05 AM
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] TSNE Memory Error

Oh, I mean that is a problem of the t-SNE implementation, it is not a problem of the "MNIST implementation". I don't know how that could happen. :D
Post by Jason Wolosonovich
Oh wow, very cool. Thank you very much for the assistance and info Alexander!
-----Original Message-----
Sent: Saturday, April 18, 2015 9:15 AM
Subject: Re: [Scikit-learn-general] TSNE Memory Error
Hi Jason,
Loading...