scikit-learn: t-SNE results in errors when reducing dim to default 2
Hi everyone, I was wondering whether anyone knows why t-SNE results into memory errors when reducing the dimension of the data to the default number, which is 2. I have the mnist dataset and I apply some transformation on it. Resulting in a reduced mnist dataset. Original: 60000x784 After transformation: 60000x200
When I apply t-SNE on transformed mnist 60000x200, I always get a memory error. Also I was wondering why doesn’t the multicore option of n_jobs=-1
exist?
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 32 (16 by maintainers)
@glemaitre I found the barnes-hut implementation on Github and I’ve been able to run it on my data using Python 2.7 (but not Python 3, there’s a bug). It was complex to get my docker image to have both Anaconda 2 and 3, but it finally worked. So, it would be good if sklearn had the barnes hut implementation. The existing one is so limiting it is almost a trick that it is in the library at all. I wish it hadn’t been there to waste my time. It should be replaced with BH-TSNE.
As to my data, the file was not large. It was a few megabytes of data, so it is surprising that sklearn’s TSNE could eat nearly 300GB of RAM. I really think this should be fixed, to use BH-TSNE.
The bh-tsne project is https://github.com/lvdmaaten/bhtsne
On Wed, Mar 22, 2017 at 3:16 PM, Joel Nothman notifications@github.com wrote:
@lesteve thanks for the reply here’s the output of a simple example
The above example was run on a PC with 24 cores and 64GB of ram and still get memory errors.
although I’m not very familiar with this code, i think it’s quite clear that our implementation still has O(N^2) memory requirements, though it may not be hard to reduce that to O(N log N). Without not looking at the paper, I don’t understand how one can perform exact nearest neighbours calculations without that memory cost.
On 23 Mar 2017 6:07 am, “Arthur Goldberg” notifications@github.com wrote:
@kirk86 https://github.com/kirk86 (Not related to this issue but to solve your personal problem I reckon you could implement t-SNE in tensorflow pretty easily which would allow you to use multiple cores… Might be a fun project)
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/8582#issuecomment-288506918, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz6xodQUAg3ouJx0kwMprTew7if-Gfks5roXF6gaJpZM4Mbi9z .
@kirk86 (Not related to this issue but to solve your personal problem I reckon you could implement t-SNE in tensorflow pretty easily which would allow you to use multiple cores/GPUs/whatever … Might be a fun project)
@jnothman barnes-hut t-SNE should run in O(NlogN) time and O(N) memory – (see Section 1. of paper https://arxiv.org/abs/1301.3342)
I am a bit surprised to see that
barnes_hut
computes all-pairs distances, when it could be usingkneighbors_graph
to sparsely calculate distances where a binary tree is appropriate.