scikit-learn: Segmentation Fault in KMeans on OSX
Describe the bug
Hi when I run this code
import numpy as np
from sklearn.cluster import KMeans
X_train = np.random.RandomState(0).random((10, 2))
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_train)
EDIT by @ogrisel: inserted .RandomState(0)
in the above snippet to make the reproducer deterministic.
I get this error
zsh: segmentation fault python debugging.py
This can’t be a memory error, it’s only a 10x2 matrix, any idea what’s wrong?
Python 3.10.4 Anaconda installation
sklearn.version ‘1.0.2’ numpy.version ‘1.22.3’
Steps/Code to Reproduce
import numpy as np
from sklearn.cluster import KMeans
X_train = np.random.random((10, 2))
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_train)
Expected Results
KMeans clusters
Actual Results
zsh: segmentation fault python debugging.py
This can’t be a memory error, it’s only a 10x2 matrix, any idea what’s wrong?
Versions
Python 3.10.4
Anaconda installation
>>> sklearn.__version__
'1.0.2'
>>> numpy.__version__
'1.22.3'
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 2
- Comments: 31 (20 by maintainers)
Could you provide the full output of:
You can also quickly try to upgrade to the latest release (1.1.1) to check if the problem is resolved.
Thanks @ogrisel and @jjerphan. I can confirm that using
conda-forge
channel for numpy, scipy and scikit-learn resolves the issue. Should this be explicitly mentioned in the documentation? Currently, the installation instructions say:You could add something like this:
When using conda on MacOS (only MacOS and only on Intel Macs?), NumPy and SciPy must be installed from the same channel as scikit-learn (
conda-forge
) to avoid compatibility issues between Intel OpenMP and LLVM OpenMP.Solutions
Here are two options for those facing this issue:
Option 1: Use
conda-forge
as the only channel (not recommended)Note that any other packages you add to the environment will have to be available in
conga-forge
channel, unless explicitly specified otherwise. This is why this solution is probably not recommended for most scenarios.Option 2: Use
conda-forge
for scikit-learn and its dependencies (recommended)Note that, unlike in Option 1, other packages can be installed from different sources you specify under
channels
(assuming they are all compatible with one another). So this is probably be the recommended solution.Dependencies
The above solutions produce different dependencies (at least on an Intel Mac), but both have been tested to work with
Kmeans
as expected. Below is the output ofsklearn.show_versions()
.Option 1
Option 2
Thanks for the report. Indeed, this configuration is known to crash. The llvm libomp and the intel libiomp are known to cause a crash when installed together: either install all numpy, scipy, scikit-learn from the defaults channel (and you should only get libiomp) or install everything from the conda-forge channel (and you should get libomp alone).
@matthewnour and others who can reproduce the problem, could you please post the output on
sklearn.show_versions()
in the environment that can trigger the problem? See @jacktang’s report here.https://github.com/scikit-learn/scikit-learn/issues/23574#issuecomment-1248889181
In particular I am interested if you also have torch’s openmp in the output and if so how did you install torch on your machine: do you use pip or conda. Ideally please include a set of pip or conda commands that make it possible to create a new venv or conda env that can trigger the problem.
Personally I do not experience this problem on my M1 laptop and I install most dependencies from conda-forge (using mambaforge): https://github.com/conda-forge/miniforge#mambaforge
Same issue for me. M1 Mac. Incase anyone has found a solution
@jjerphan , yes, occasionally I use other conda channels for packages that instruct so. For example,
plotly::plotly
,plotly::python-kaleido
andpytorch::pytorch
.I think adding a short sentence to the install docs would be good. I’d make it something like the advice from https://github.com/scikit-learn/scikit-learn/issues/23574#issuecomment-1782261722 - “make sure to install numpy, scipy and scikit-learn from the same conda channel”.
People in the situation described by @ogrisel might be interested in reading those notes from
threadpoolctl
.Also note, we just merged #27167 that might or might not impact the crash you observed in this issue. If anybody who could reproduce #23574 can tell me if #27167 fixes the problem for them, that would be very nice.
It would be very helpful if you could:
conda list
for future referenceconda
environment using theconda-forge
channel,conda create -n test-env -c conda-forge scikit-learn
gdb
orldb
to get a stack-trace from the segmentation fault, see https://github.com/scikit-learn/scikit-learn/issues/23574#issuecomment-1152379936 for gdb@EoinKenny as a work-around you may want to try to create a new environment from conda-forge and see whether the problem persists, i.e. something like this
You could also use
gdb
to see whether it gives some useful additional infoThen type
r
(for run) and thenbt
(for backtrace). There maybe more info about gdb and python in https://wiki.python.org/moin/DebuggingWithGdb.