arrayfire: af::inverse CUDA error
Hello,
I recently changed my GPU from a Quadro M2000, 4096 MB (CUDA Compute 5.2) to a GeForce GTX 1080 Ti, 11264 MB (CUDA Compute 6.1) to have more speed and less memory limitation.
I am trying to run a previously validated algorithm on “bigger” data but I am experiencing a CUDA error, the screen goes black and all GPU application crashes.
ArrayFire v3.6.2 (CUDA, 64-bit Windows, build dc38ef13) Platform: CUDA Toolkit 10, Driver: CUDA Driver Version: 10010 [0] GeForce GTX 1080 Ti, 11264 MB, CUDA Compute 6.1 Nvidia 3D Vision Controller Driver 390.41 Nvidia 3D Vision Driver 419.35 Nvidia Graphics Driver 419.35 Nvidia CUDA Development 9.1 (not using it but it is installed)
I can reproduce the error using this simple code, the error occurs randomly regardless of the size, but, seems to be systematic with large array (N>5000).
#include <arrayfire.h>
#include <cstdio>
#include <cstdlib>
using namespace af;
void function();
int main(int argc, char *argv[])
{
try {
// Select a device and display arrayfire info
int device = argc > 1 ? atoi(argv[1]) : 0;
af::setDevice(device);
af::info();
for (int i = 0; i < 1000; i++) {
function();
}
}
catch (af::exception& e) {
fprintf(stderr, "%s\n", e.what());
throw;
}
return 0;
}
void function() {
int N = 2000;
int D = 3;
af::array ones = constant(1, N, f32);
af::array rand1 = randu(N, D, f32);
af::array q2 = randu(N, N - D - 1, f32); // size is [N][N-D-1]
af::array K = randu(N, N, f32); // size is [N][N-D-1]
af::array id1 = 0.001 * af::identity(N - D - 1, N - D - 1, f32);
af::array T = join(1, ones, rand1);
af::array gamma = matmul(inverse(matmul(matmul(q2, K, AF_MAT_TRANS), q2) + id1), matmul(q2, T, AF_MAT_TRANS));
}
ArrayFire Exception (Internal error:998):
In function struct cusolverDnContext *__cdecl cuda::solverDnHandle(void)
In file src\backend\cuda\platform.cpp:487
CUDA Error (4): unspecified launch failure
In function class af::array __cdecl af::inverse(const class af::array &,const af_mat_prop)
In file src\api\cpp\lapack.cpp:116
Someone has an idea of what can cause this error? The only thing I changed that cause this error is my GPU and my Nvidia drivers. The arrayfire code is still the same.
Thank you!
Edit: by adding “af::sync();” after the function call i got the following error (also after few iterations):
ArrayFire Exception (Internal error:998):
In function void __cdecl cuda::sync(int)
In file src\backend\cuda\platform.cpp:642
CUDA Error (4): unspecified launch failure
In function void __cdecl af::sync(const int)
In file src\api\cpp\device.cpp:115
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 30 (11 by maintainers)
Update:
Using the source code of AF from Github to build the small project to stress the code. I have the same crash:
I will try to downgrade the Nvidia drivers tomorrow to see if i have the same problem.
Okay, please give me some time, I will run this program on Windows and get back to you. I am unable to reproduce this on linux at the moment.