OpenBLAS: COPY memory alignment SIGSEGV bug
Minimal example check_copy.c, compiled via
$ cc -g -o check_copy check_copy.c /usr/lib/libopenblas.so.0
Error occurs for offset == 1:
$ valgrind ./check_copy
==16357== Memcheck, a memory error detector
==16357== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==16357== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==16357== Command: ./check_copy
==16357==
1.000000 1.000000
0.500000 0.500000
0.333333 0.333333
==16357==
==16357== Process terminating with default action of signal 11 (SIGSEGV)
==16357== General Protection Fault
==16357== at 0x64FB980: dcopy_k_HASWELL (in /usr/lib/libopenblasp-r0.2.18.so)
==16357== by 0x6EEE82F: (below main) (libc-start.c:291)
Changing the increment of offset to offset += 8 makes the code run fine.
System:
$ uname -a
Linux *** 4.4.0-67-generic #88-Ubuntu SMP Wed Mar 8 16:34:45 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ cc --version
cc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
$ zcat /usr/share/doc/libopenblas-base/changelog.Debian.gz | head -1
openblas (0.2.18-1ubuntu1) xenial; urgency=medium
$ cat /proc/cpuinfo | head -30
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
stepping : 2
microcode : 0x36
cpu MHz : 1201.921
cache size : 15360 KB
physical id : 0
siblings : 12
core id : 0
cpu cores : 6
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
bugs :
bogomips : 6599.88
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
I know it is not a good idea to have the vectors not aligned at 8 byte address boundaries. However, my expectation would be that the code does not crash in this case.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 33 (12 by maintainers)
@martin-frbg Your solution allows me to correctly run both the check_copy.c example and my application. I am not so surprised that my application runs smoothly after fixing DCOPY: if there was a similar problem in the BLAS Level 2 and Level 3 routines that we mostly use, I guess somebody would have found it much earlier. Thank you for your help. I hope that this fix or a similar one will be included in the next release of OpenBLAS.
@brada4 The data is obviously 4-byte aligned here⦠The problem is the 16-byte alignment (i.e. the width of 4 floats) required by SSE instructions, whereas the sub-vector or sub-matrix that we want to copy does not necessarily start at an index that is a multiple of 4 within the larger allocated vector or matrix.
It is common sense to align 4-byte (float) to 4 bytes, yours is not, glibc memcpy() will crash same way,
Most kernels assume aligned input anyway. But that does not mean need to crash in unaligned cases. i.e they start with 2^n-sized data blocks then process tail of array in smaller blocks. To handle unaligned input optimally other tail must process head of input until aligned block start is reached.