OpenBLAS: Segfault when NUM_THREADS is smaller than host vCPUs

Most context can be found here: https://github.com/xianyi/OpenBLAS/pull/2982

But in summary:

A segfault will occur if the NUM_THREADS used to build openblas is lower than the host’s vCPU count when using OMP.

[Switching to Thread 0x7ffebba6e640 (LWP 104473)]
0x00007fffe95cbe70 in dgemm_itcopy_ZEN () from /nix/store/3v2i5ga27y9al3z0d8ccwaa389qz3ma6-lapack-3/lib/liblapack.so.3
(gdb) backtrace
#0  0x00007fffe95cbe70 in dgemm_itcopy_ZEN () from /nix/store/3v2i5ga27y9al3z0d8ccwaa389qz3ma6-lapack-3/lib/liblapack.so.3
#1  0x00007fffe882416a in inner_thread () from /nix/store/3v2i5ga27y9al3z0d8ccwaa389qz3ma6-lapack-3/lib/liblapack.so.3
#2  0x00007fffe8956bf6 in exec_blas._omp_fn () from /nix/store/3v2i5ga27y9al3z0d8ccwaa389qz3ma6-lapack-3/lib/liblapack.so.3
#3  0x00007fffe347f916 in ?? () from /nix/store/f23sq7lk6xfrvz467ffkpzackyk5q8dm-gfortran-9.3.0-lib/lib/libgomp.so.1
#4  0x00007ffff7c20e9e in start_thread () from /nix/store/5didcr1sjp2rlx8abbzv92rgahsarqd9-glibc-2.32/lib/libpthread.so.0
#5  0x00007ffff793866f in clone () from /nix/store/5didcr1sjp2rlx8abbzv92rgahsarqd9-glibc-2.32/lib/libc.so.6

This currently has a “workaround” by setting the environment variable OMP_NUM_THREADS to a value equal or lower to the NUM_THREADS used during build. However, desired behavior is that it doesn’t crash.

EDIT: Also of note, this fails when running the numpy test suite. Openblas is able to run it’s test suite without crashing. blas, openblas, and lapack are all aliased to openblas

$ cat numpy/site.cfg
[blas]
include_dirs=/nix/store/bm0mjpbnaxixma2wvmj99zq2kfq1017h-blas-3-dev/include
library_dirs=/nix/store/35lvfykv13mnva6sipnrqks05mvrldi0-blas-3/lib
runtime_library_dirs=/nix/store/35lvfykv13mnva6sipnrqks05mvrldi0-blas-3/lib

[lapack]
include_dirs=/nix/store/s9rb0jhplgsfcnay1mp7kn1zhxyppq0n-lapack-3-dev/include
library_dirs=/nix/store/w6xjrzfnxsi7jwrlvfy10qcljm6gyhss-lapack-3/lib
runtime_library_dirs=/nix/store/w6xjrzfnxsi7jwrlvfy10qcljm6gyhss-lapack-3/lib

[openblas]
include_dirs=/nix/store/bm0mjpbnaxixma2wvmj99zq2kfq1017h-blas-3-dev/include:/nix/store/s9rb0jhplgsfcnay1mp7kn1zhxyppq0n-lapack-3-dev/include
libraries=lapack,lapacke,blas,cblas
library_dirs=/nix/store/35lvfykv13mnva6sipnrqks05mvrldi0-blas-3/lib:/nix/store/w6xjrzfnxsi7jwrlvfy10qcljm6gyhss-lapack-3/lib
runtime_library_dirs=/nix/store/35lvfykv13mnva6sipnrqks05mvrldi0-blas-3/lib:/nix/store/w6xjrzfnxsi7jwrlvfy10qcljm6gyhss-lapack-3/lib

System / Build Info

$ nix eval -f default.nix openblas.makeFlags
[ "BINARY=64" "CC=cc" "CROSS=0" "DYNAMIC_ARCH=1" "FC=gfortran" "HOSTCC=cc" "INTERFACE64=1" "NO_AVX512=1" "NO_BINARY_MODE=" "NO_SHARED=0" "NO_STATIC=1" "NUM_THREADS=64" "TARGET=ATHLON" "USE_OPENMP=1" ]

CC = gcc-9.3.0 arch = x86_64 platform = linux CPU = 3990X ( 64 cores / 128 thread)

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 24 (14 by maintainers)

Commits related to this issue

Most upvoted comments

Ok found the problem. What happens is:

  • the fork handler resets the blas_thread_buffer array and the blas_server_avail flag
  • A BLAS function (in this case cblas_ssyrk) calls num_cpu_avail
  • This detects a mismatch: blas_cpu_number=64 (capped in previous settings) and openmp_nthreads=256
  • This then calls goto_set_num_threads(openmp_nthreads)
  • That function adjusts the blas_thread_buffer allocating memory where required
  • exec_blas is called a bit later (by gemm) which finds blas_server_avail=0 and calls blas_thread_init which reinits the blas_thread_buffer without checking it first
  • This then runs out of buffers to allocate: NUM_BUFFERS=64*2, 32 are allocated by goto_set_num_threads, 1 by gem directly, then 32 further ones are requested by blas_thread_init totalling 65 -> last entry is zero

This also doesn’t come up for small thread numbers because the min. for NUM_BUFFERS is 50, so I guess below 25 cores there won’t be an observable problem. And finally: OpenBLAS does show an error/warning that the allocation fails but the numpy test suite swallows it

PR coming up.

Sure. As it turned out it was indeed my commit you PRed to revert which caused the issue although the reason was much more complicated than thought.

@brada4 not sure what you are seeing, and no disassembly needed I guess as the gemm_tcopy_4 is plain C - it is trying to copy data from the buffer normally pointed to by b that simply is not there as there was no slot available for it in the buffers array.