BERTopic: Inconsistent results on different machines

Hey!

Recently I found out that my code is giving me different results when running on local (M1 macbook), local Docker or k8s docker containers. I use random_seed for UMAP indeed however also 20newsgroups is behaving differently and returning not exact same results. From my code perspective difference is quite big - on localhost script generated ~150topics and on cloud just around 40 (even initial set of topics was similar but not exact). I double-checked whether same data goes in and tried to set numpy different random seeds but nothing really happened. I tested this with python:3.7 and python:3.9 as well as bertopic version 0.10.0 or 0.9.4.

Any idea or experience how to make results same across different platforms?

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 20 (9 by maintainers)

Most upvoted comments

I have run into the same issue, and on the same machine (iMac running Mojave) running on the same dataset at different times: Python version: 3.9.12 (v3.9.12:b28265d7e6, Mar 23 2022, 18:17:11) numpy version: 1.23.5 scikit-learn version: 1.2.0 numba version: 0.56.4 umap version: 0.5.3

this is even using random_state=np.random.RandomState(42) as has been suggested elsewhere. Attached is an example output using the same input data run two different time. This is a shame because in my limited testing UMAP outperforms tSNE but if we can’t get the same results from session to session it limits the usefulness. Is there a solution? Peter. UMAP.pdf