OpenBLAS: Segfault on dgemm_oncopy_HASWELL triggered by numpy.matmul inside a docker container (v0.3.13.dev)

Hey, I’m still seeing segfaults when doing numpy.matmul with two big matrices (numpy v1.20.1, OpenBLAS v0.3.13.dev).

This looks like potentially related to #2728 ?

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fd34c490fa3, pid=1, tid=0x00007fd34f134740
#
# JRE version: OpenJDK Runtime Environment (8.0_242-b08) (build 1.8.0_242-b08)
# Java VM: OpenJDK 64-Bit Server VM (25.242-b08 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libopenblasp-r0-5bebc122.3.13.dev.so+0xe77fa3]  dgemm_oncopy_HASWELL+0x193

The stacktrace points to a line where it does sth. like np.matmul(samples, samples.T).

I was running the code in a docker container (enterprise environment), where NumPy was installed using pip. Here’s the spec of the compute cluster, from which 6 CPUs were allocated to the container. image

threadpool_info via threadpoolctl shows the following, which confirms it was OpenBLAS v0.3.13.dev and that num_threads was correctly recognized to be 6.

[{'filepath': '/job/.local/lib/python3.7/site-packages/numpy.libs/libopenblasp-r0-5bebc122.3.13.dev.so',
  'internal_api': 'openblas',
  'num_threads': 6,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.13.dev'},
 {'filepath': '/job/.local/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1',
  'internal_api': 'openmp',
  'num_threads': 6,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filepath': '/job/.local/lib/python3.7/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so',
  'internal_api': 'openblas',
  'num_threads': 6,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.9'}]

Let me know if you need any other information!

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19 (5 by maintainers)

Most upvoted comments

Maybe, maybe not. Why does the JRE feature in your context at all, are you perhaps loading java-based libraries like libhdfs ? (#2821) ?

Java runtime being involved suggests to me that you might simply be exceeding the default stacksize imposed by it. Please try if setting _JAVA_OPTIONS=“-Xss4096k” in the environment helps.