OpenBLAS: sdot yields wrong results for odd sizes > 2^24 and all sizes > 2^29
When I compute a dot product of a vector of all ones with itself, the result should be the size of the vector. This starts to no longer hold at size 2^24 + 1 as the following program shows:
// file tmp.c
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include "cblas.h"
int main(int argc, char **argv) {
int exp = atoi(argv[1]);
int offset = atoi(argv[2]);
size_t size = (1 << exp) + offset;
float *a = (float *) malloc(size * sizeof(float));
assert(a != NULL);
for (size_t i=0; i < size; ++i)
a[i] = 1;
float dot = cblas_sdot(size, a, 1, a, 1);
printf("dot = %f\n", dot);
printf("size = %zu = (1 << %i) + (%i)\n", size, exp, offset);
return 0;
}
I get
$ ./tmp 24 0
dot = 16777216.000000
size = 16777216 = (1 << 24) + (0)
$ ./tmp 24 1
dot = 16777216.000000
size = 16777217 = (1 << 24) + (1)
$ ./tmp 24 2
dot = 16777218.000000
size = 16777218 = (1 << 24) + (2)
$ ./tmp 24 3
dot = 16777220.000000
size = 16777219 = (1 << 24) + (3)
$ ./tmp 24 4
dot = 16777220.000000
size = 16777220 = (1 << 24) + (4)
$ ./tmp 24 3
dot = 16777220.000000
size = 16777221 = (1 << 24) + (5)
and then, for sizes larger than 2^29 the result just stays constant at 2^29:
$ ./tmp 29 1
dot = 536870912.000000
size = 536870913 = (1 << 29) + (1)
$ ./tmp 29 2
dot = 536870912.000000
size = 536870914 = (1 << 29) + (2)
$ ./tmp 29 4
dot = 536870912.000000
size = 536870916 = (1 << 29) + (4)
$ ./tmp 30 -1
dot = 536870912.000000
size = 1073741823 = (1 << 30) + (-1)
$ ./tmp 30 0
dot = 536870912.000000
size = 1073741824 = (1 << 30) + (0)
I’m using the latest masterdevelop version of OpenBLAS, and I statically link the library to not accidentally dynamically link an older version:
$ gcc -o tmp tmp.c [...]/OpenBLAS/build/lib/libopenblas.a -I[...]/OpenBLAS/build/include -std=c99
$ ldd tmp
linux-vdso.so.1 (0x00007ffe2ed79000)
libc.so.6 => /lib64/libc.so.6 (0x00007f58b2fdc000)
/lib64/ld-linux-x86-64.so.2 (0x00007f58b338c000)
Some more info:
$ uname -a
Linux sabayon 4.11.0-sabayon #1 SMP Wed Jul 26 12:57:06 UTC 2017 x86_64 Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz GenuineIntel GNU/Linux
See also this related Numpy issue
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 65 (24 by maintainers)
Seems to me that “float input, double accumulation and result” is what dsdot (cblas_dsdot in your case) was written for, perhaps worth a try.
that yields parity difference a compiler loop vectorisation property. Ill try to get to clang-vs-gcc at least.