umap: Stuck at constructing embedding?

I currently have a dataset with more than 10 million rows of data and 384 dimensions. I use PCA to reduce the 384 dimensions to 10, and then apply UMAP via the BertTopic library.

To avoid running into memory issues, I am using a machine with 1TB of RAM and 128 cores. However, it seems that the process hang at “Construct embedding”, and only about 500GB of RAM is being used (so not a memory issue).

Here are the code and verbose:


embeddings = np.load('embeddings.npy')

pca = PCA(n_components=10)
embeddings_pca = pca.fit_transform(embeddings)

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', low_memory = True, verbose=True)

# Setting HDBSCAN model
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

topic_model = BERTopic(umap_model = umap_model, hdbscan_model=hdbscan_model,  verbose=True, seed_topic_list=seed_topic_list, low_memory=True, calculate_probabilities=True, vectorizer_model=vectorizer_model)

#topics, probs = topic_model.fit_transform(docs)

topic_model = topic_model.fit(docs, embeddings_pca)
UMAP(angular_rp_forest=True, dens_frac=0.0, dens_lambda=0.0, metric='cosine',
     min_dist=0.0, n_components=5, verbose=True)
Construct fuzzy simplicial set
Tue Sep 28 11:33:15 2021 Finding Nearest Neighbors
Tue Sep 28 11:33:15 2021 Building RP forest with 64 trees
Tue Sep 28 11:34:42 2021 NN descent for 23 iterations
	 1  /  23
	 2  /  23
	Stopping threshold met -- exiting after 2 iterations
Tue Sep 28 11:49:29 2021 Finished Nearest Neighbor Search
Tue Sep 28 11:50:33 2021 Construct embedding

If I understand correctly, the most memory consuming step should be nearest neighbour search (which it completed with no issue)? How come does it stuck at constructing embeddings?

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 24 (1 by maintainers)

Most upvoted comments

n_neighbors

@jlmelville What is a reasonable number for n_neighbors when I have 10 million data points?

Really hard to say. I would start by seeing if n_neighbors=30 works and take it from there. Obviously with such a large dataset, doubling parameters isn’t something to do lightly, but parameters for experimenting with the spectral initialization directly aren’t exposed through the UMAP interface, so it’s difficult to do anything else.

Or can I simply change the initialisation method to random here?

init="random" will work but it’s hard for UMAP (or any dimensionality reduction method that works in a similar way) to recover the global structure from a random start. If you have access to an efficient PCA package, then extracting the first two principal components (suitably scaled) and passing that as the init parameter would be a better starting point.

It’s also possible that there is something in your dataset that is making the initialization take so long: are there lots of duplicates or close duplicates or all-zero rows? Bad behavior of the spectral initialization does seem to be related to the conditioning of the graph Laplacian matrix.

Your stack trace from the interrupt indicates that the problem is occurring at the spectral initialization stage. Where this has happened to me it seems to be when the graph is very nearly disconnected, but there are a few low-affinity edges that mean the disconnection detection routine still sees it as one connected graph.

If you are able to, try increasing n_neighbors.