taichi: [Windows + CUDA] Randomly breaking down when launching GPU kernels
I really want to use taichi with CUDA but my system is Win10. Therefore, I built the project on Win 10 with Titan Xp, CUDA 10.0.
I think I have built it successfully. I ran fractal.py given in the tutorial under x86_64 and it works well. However, when I am using CUDA, there will be a bug:
[E 02/01/20 21:32:06.116] [unified_allocator.cpp:taichi::Tlang::UnifiedAllocator::UnifiedAllocator@27] Cuda Error cudaErrorInvalidDevice: invalid device ordinal
[E 02/01/20 21:32:06.116] Received signal 22 (SIGABRT)
I found the location:
check_cuda_errors(
cudaMemAdvise(_cuda_data, size, cudaMemAdviseSetPreferredLocation, 0));
I tried a simple script from this and found that this two lines might not be used on my computer.
#include <cstdio>
#include <iostream>
#define CUDA_ERR_CHECK(x) \
do { cudaError_t err = x; if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error %d \"%s\" at %s:%d\n", \
(int)err, cudaGetErrorString(err), \
__FILE__, __LINE__); \
exit(1); \
}} while (0);
template<typename T>
__global__ void kernel(size_t* value)
{
*value = sizeof(T);
}
int main()
{
size_t size = 1024 * 1024 * 1024;
int device_id = 0, result = 0;
CUDA_ERR_CHECK(cudaDeviceGetAttribute(&result, cudaDevAttrConcurrentManagedAccess, device_id));
//if (result) {
std::cout << result << std::endl;
size_t* managed = NULL;
CUDA_ERR_CHECK(cudaMallocManaged(&managed, size, cudaMemAttachGlobal));
CUDA_ERR_CHECK(cudaMemAdvise(managed, size,
cudaMemAdviseSetPreferredLocation, 0));
CUDA_ERR_CHECK(cudaGetLastError());
kernel<double><<<1, 1>>>(managed);
CUDA_ERR_CHECK(cudaGetLastError());
CUDA_ERR_CHECK(cudaDeviceSynchronize());
CUDA_ERR_CHECK(cudaFree(managed));
size_t* memory = NULL;
CUDA_ERR_CHECK(cudaMalloc(&memory, size));
kernel<double><<<1, 1>>>(memory);
CUDA_ERR_CHECK(cudaGetLastError());
CUDA_ERR_CHECK(cudaDeviceSynchronize());
CUDA_ERR_CHECK(cudaFree(memory));
//}
return 0;
}
The output is:
0
CUDA error 10 "invalid device ordinal" at C:/Users/Sireer/source/repos/test/test/kernel.cu:32
I also commented the 2 lines in unified_allocator.cpp. This time, if I do not use GUI, the fractal.py script can work perfectly with CUDA. However, with GUI enabled, the program will break down after some iterations(In some cases, it will break down in the first iteration). I found that the reason is when calling .to_numpy function, there will be a large possibility that the program break down without giving any error message. Sometimes, the log will be:
[T 02/01/20 22:31:16.721] [logging.cpp:taichi::Logger::Logger@68] Taichi core started. Thread ID = 19060
[Taichi version 0.4.0, cuda 10.0, commit 50e621a4]
After preprocessing:
def complex_sqr(z_by_value__):
z = ti.expr_init(z_by_value__)
return ti.Vector([ti.subscript(z, 0) * ti.subscript(z, 0) - ti.subscript(
z, 1) * ti.subscript(z, 1), ti.subscript(z, 1) * ti.subscript(z, 0) * 2])
[I 02/01/20 22:31:16.807] [memory_pool.cpp:taichi::Tlang::MemoryPool::MemoryPool@14] Memory pool created. Default buffer size per allocator = 1024 MB
[I 02/01/20 22:31:16.965] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::CUDAContext@154] Using CUDA Device [id=0]: TITAN Xp
[I 02/01/20 22:31:16.966] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::CUDAContext@162] CUDA Device Compute Capability: 6.1
[I 02/01/20 22:31:17.055] [taichi_llvm_context.cpp:taichi::Tlang::TaichiLLVMContext::TaichiLLVMContext@57] Creating llvm context for arch: x86_64
[I 02/01/20 22:31:17.076] [C:\notebook\taichi\git\taichi\python\taichi\lang\impl.py:materialize@124] Materializing layout...
[D 02/01/20 22:31:17.076] [snode.cpp:taichi::Tlang::SNode::create_node@48] Non-power-of-two node size 640 promoted to 1024.
[D 02/01/20 22:31:17.076] [snode.cpp:taichi::Tlang::SNode::create_node@48] Non-power-of-two node size 320 promoted to 512.
[T 02/01/20 22:31:17.577] [taichi_llvm_context.cpp:taichi::Tlang::compile_runtime_bitcode@110] Compiling runtime module bitcode...
[I 02/01/20 22:31:18.609] [struct_llvm.cpp:taichi::Tlang::StructCompilerLLVM::run::<lambda_2edad8eb91b2dd8e92352d6e0f41d9de>::operator ()@286] Allocating data structure of size 2097152 B
[I 02/01/20 22:31:18.622] [unified_allocator.cpp:taichi::Tlang::UnifiedAllocator::UnifiedAllocator@17] Allocating unified (CPU+GPU) address space of size 1024 MB
Initializing runtime with 3 snode(s)...
Runtime initialized.
[D 02/01/20 22:31:19.729] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 0
[D 02/01/20 22:31:19.736] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110] Allocating memory 8232 B (alignment 4096B)
[D 02/01/20 22:31:19.736] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112] Allocated. Ptr = 0xd005a8000
[D 02/01/20 22:31:19.738] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 1
[D 02/01/20 22:31:19.738] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110] Allocating memory 8232 B (alignment 4096B)
[D 02/01/20 22:31:19.739] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112] Allocated. Ptr = 0xd005ab000
[D 02/01/20 22:31:19.740] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 2
[D 02/01/20 22:31:19.740] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110] Allocating memory 8232 B (alignment 4096B)
[D 02/01/20 22:31:19.740] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112] Allocated. Ptr = 0xd005ae000
[D 02/01/20 22:31:19.742] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 3
[D 02/01/20 22:31:19.742] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110] Allocating memory 3145728 B (alignment 4096B)
[D 02/01/20 22:31:19.742] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112] Allocated. Ptr = 0xd005b1000
[I 02/01/20 22:31:19.750] [taichi_llvm_context.cpp:taichi::Tlang::TaichiLLVMContext::TaichiLLVMContext@57] Creating llvm context for arch: cuda
[T 02/01/20 22:31:20.267] [taichi_llvm_context.cpp:taichi::Tlang::compile_runtime_bitcode@110] Compiling runtime module bitcode...
[I 02/01/20 22:31:21.014] [C:\notebook\taichi\git\taichi\python\taichi\lang\kernel.py:materialize@180] Compiling kernel paint_c4_0_...
Before preprocessing:
@ti.kernel
def paint(t: ti.f32):
for i, j in pixels:
c = ti.Vector([-0.8 - 0.2 * ti.sin(t), ti.sin(t) * 0.2])
z = ti.Vector([float(i) / n - 1, float(j) / n - 0.5]) * 2
iterations = 0
while z.norm() < 40 and iterations < 50:
z = complex_sqr(z) + c
iterations += 1
pixels[i, j] = 1 - iterations * 0.02
After preprocessing:
def paint():
t = ti.decl_scalar_arg(ti.f32)
if 1:
i = ti.Expr(ti.core.make_id_expr(''))
j = ti.Expr(ti.core.make_id_expr(''))
___loop_var = pixels
___expr_group = ti.make_expr_group(i, j)
ti.core.begin_frontend_struct_for(___expr_group, ___loop_var.loop_range
().ptr)
c = ti.expr_init(ti.Vector([-0.8 - 0.2 * ti.sin(t), ti.sin(t) * 0.2]))
z = ti.expr_init(ti.Vector([ti.ti_float(i) / n - 1, ti.ti_float(j) / n -
0.5]) * 2)
iterations = ti.expr_init(0)
if 1:
ti.core.begin_frontend_while(ti.Expr(1).ptr)
__while_cond = ti.expr_init(ti.logical_and(z.norm() < 40, iterations <
50))
if 1:
__cond = __while_cond
ti.core.begin_frontend_if(ti.Expr(__cond).ptr)
ti.core.begin_frontend_if_true()
pass
ti.core.pop_scope()
ti.core.begin_frontend_if_false()
ti.core.insert_break_stmt()
ti.core.pop_scope()
z.assign(complex_sqr(z) + c)
iterations.augassign(1, 'Add')
ti.core.pop_scope()
del __while_cond
ti.subscript(pixels, i, j).assign(1 - iterations * 0.02)
del iterations
del z
del c
ti.core.end_frontend_range_for()
del j
del i
[D 02/01/20 22:31:21.047] [C:\notebook\taichi\git\taichi\python\taichi\lang\kernel.py:__call__@344] Launching kernel paint...
[I 02/01/20 22:31:21.181] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::compile@174] PTX size: 4.34KB
[I 02/01/20 22:31:21.190] [C:\notebook\taichi\git\taichi\python\taichi\lang\kernel.py:materialize@180] Compiling kernel tensor_to_ext_arr_c8_0_...
Before preprocessing:
@ti.kernel
def tensor_to_ext_arr(tensor: ti.template(), arr: ti.ext_arr()):
for I in ti.grouped(tensor):
arr[I] = tensor[I]
After preprocessing:
def tensor_to_ext_arr():
arr = ti.decl_ext_arr_arg(ti.f32, 2)
if 1:
___loop_var = ti.grouped(tensor)
I = ti.make_var_vector(size=___loop_var.loop_range().dim())
___expr_group = ti.make_expr_group(I)
ti.core.begin_frontend_struct_for(___expr_group, ___loop_var.loop_range
().ptr)
ti.subscript(arr, I).assign(ti.subscript(tensor, I))
ti.core.end_frontend_range_for()
del I
[D 02/01/20 22:31:21.202] [C:\notebook\taichi\git\taichi\python\taichi\lang\kernel.py:__call__@344] Launching kernel tensor_to_ext_arr...
What can I do to make it work?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 20 (12 by maintainers)
Sorry for not mentioning that I made a lot of changes to build it on Win 10. I’d like to list them here if anyone else want to help:
I used the script above to make a VS 2017 project. After this, I made several changes in
taichi_coreandtiprojects. Ichange
CUDAincluding directory and addpybin11including directory.change the linker path of
cudart.libandcuda.libintaichi_coreproject.change the preprocessor
TLANG_CUDA_VERSION = '10.0'toTLANG_CUDA_VERSION = "10.0".add these few lines in the system header
winnt.h:comment two lines in the
unified_allocator.cpp:search for all
/usr/local/in the entire VS solution and change the path to the correct path on my computer.I got the same problem after adding
ti.sync()infractal.py. I also found that if I commentti.sync()into_numpy/from_numpyor run the following script, the program will also crash:Everything goes very well! Awesome! 😀
Thanks for reporting! This is really informative.
So there are two issues:
to_numpysometimes crashes on Windows when CUDA enabled. Could you confirm that this issue has nothing to do this the GUI system, i.e.to_numpycrashes even no when GUI instance is created?I suspect this is something related to the memory pool and virtual memory, yet haven’t confirmed.
Here is a thread discussing what sounds like the same problem. they say updating the NVIDIA driver to 418.81 might help.
https://devtalk.nvidia.com/default/topic/1031803/cuda-programming-and-performance/using-unified-memory-causes-system-crash/1
I tried the latest commit. I am not sure if I did it correctly. Here is the code:
In most cases, there will be no difference but I did got this log(I mean in some cases, the program will crash without giving any error message.):
Wow! You are so efficient 👍!
The latest commit on the master branch has fixed the build issues you mentioned. It works on my Windows 10 machine with CUDA 10.1. Feel free to open a PR if you run into any other build errors on your end.
to_numpyas nothing special, except that it forces CUDA to synchronize (i.e.cudaDeviceSynchronize()). Actually, you can crashfractal.pywithoutto_numpyif you callti.sync()after kernel launches… Not sure what has caused this.Thanks! There will be the same problem without GUI when calling
to_numpyorfrom_numpy. Actually, this bug happened in the line 28img = img.to_numpy(), taichi.misc.gui.GUI.set_image. I ran this script and in some cases, the program will break:The program might break at different locations, a possible log might be:
To make it simple, I tried the script below which only uses
to_numpyorfrom_numpy. There will be the same problem:The program might break when
i=20~40. To be noticed, the program does not break immediately. It seems it will pause for several seconds before it break. Sometimes, it will continue to run 2-3 iterations after the first(the first two) shorter pause(s).