taichi: [Windows + CUDA] Randomly breaking down when launching GPU kernels

I really want to use taichi with CUDA but my system is Win10. Therefore, I built the project on Win 10 with Titan Xp, CUDA 10.0.
I think I have built it successfully. I ran fractal.py given in the tutorial under x86_64 and it works well. However, when I am using CUDA, there will be a bug:

[E 02/01/20 21:32:06.116] [unified_allocator.cpp:taichi::Tlang::UnifiedAllocator::UnifiedAllocator@27] Cuda Error cudaErrorInvalidDevice: invalid device ordinal
[E 02/01/20 21:32:06.116] Received signal 22 (SIGABRT)

I found the location:

    check_cuda_errors(
        cudaMemAdvise(_cuda_data, size, cudaMemAdviseSetPreferredLocation, 0));

I tried a simple script from this and found that this two lines might not be used on my computer.


#include <cstdio>
#include <iostream>

#define CUDA_ERR_CHECK(x)                                  \
    do { cudaError_t err = x; if (err != cudaSuccess) {    \
        fprintf(stderr, "CUDA error %d \"%s\" at %s:%d\n", \
        (int)err, cudaGetErrorString(err),                 \
        __FILE__, __LINE__);                               \
        exit(1);                                           \
    }} while (0);


template<typename T>
__global__ void kernel(size_t* value)
{
    *value = sizeof(T);
}

int main()
{
    size_t size = 1024 * 1024 * 1024;

	int device_id = 0, result = 0;
	CUDA_ERR_CHECK(cudaDeviceGetAttribute(&result, cudaDevAttrConcurrentManagedAccess, device_id));
	
	//if (result) {
		std::cout << result << std::endl;
			size_t* managed = NULL;
		CUDA_ERR_CHECK(cudaMallocManaged(&managed, size, cudaMemAttachGlobal));
		CUDA_ERR_CHECK(cudaMemAdvise(managed, size,
			cudaMemAdviseSetPreferredLocation, 0));
		CUDA_ERR_CHECK(cudaGetLastError());
		kernel<double><<<1, 1>>>(managed);
		CUDA_ERR_CHECK(cudaGetLastError());
		CUDA_ERR_CHECK(cudaDeviceSynchronize());
		CUDA_ERR_CHECK(cudaFree(managed));
		size_t* memory = NULL;
		CUDA_ERR_CHECK(cudaMalloc(&memory, size));
		kernel<double><<<1, 1>>>(memory);
		CUDA_ERR_CHECK(cudaGetLastError());
		CUDA_ERR_CHECK(cudaDeviceSynchronize());
		CUDA_ERR_CHECK(cudaFree(memory));
	//}
    
    return 0;
}

The output is:

0
CUDA error 10 "invalid device ordinal" at C:/Users/Sireer/source/repos/test/test/kernel.cu:32

I also commented the 2 lines in unified_allocator.cpp. This time, if I do not use GUI, the fractal.py script can work perfectly with CUDA. However, with GUI enabled, the program will break down after some iterations(In some cases, it will break down in the first iteration). I found that the reason is when calling .to_numpy function, there will be a large possibility that the program break down without giving any error message. Sometimes, the log will be:

[T 02/01/20 22:31:16.721] [logging.cpp:taichi::Logger::Logger@68] Taichi core started. Thread ID = 19060
[Taichi version 0.4.0, cuda 10.0, commit 50e621a4]
After preprocessing:
def complex_sqr(z_by_value__):
  z = ti.expr_init(z_by_value__)
  return ti.Vector([ti.subscript(z, 0) * ti.subscript(z, 0) - ti.subscript(
      z, 1) * ti.subscript(z, 1), ti.subscript(z, 1) * ti.subscript(z, 0) * 2])

[I 02/01/20 22:31:16.807] [memory_pool.cpp:taichi::Tlang::MemoryPool::MemoryPool@14] Memory pool created. Default buffer size per allocator = 1024 MB
[I 02/01/20 22:31:16.965] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::CUDAContext@154] Using CUDA Device [id=0]: TITAN Xp
[I 02/01/20 22:31:16.966] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::CUDAContext@162] CUDA Device Compute Capability: 6.1
[I 02/01/20 22:31:17.055] [taichi_llvm_context.cpp:taichi::Tlang::TaichiLLVMContext::TaichiLLVMContext@57] Creating llvm context for arch: x86_64
[I 02/01/20 22:31:17.076] [C:\notebook\taichi\git\taichi\python\taichi\lang\impl.py:materialize@124] Materializing layout...
[D 02/01/20 22:31:17.076] [snode.cpp:taichi::Tlang::SNode::create_node@48] Non-power-of-two node size 640 promoted to 1024.
[D 02/01/20 22:31:17.076] [snode.cpp:taichi::Tlang::SNode::create_node@48] Non-power-of-two node size 320 promoted to 512.
[T 02/01/20 22:31:17.577] [taichi_llvm_context.cpp:taichi::Tlang::compile_runtime_bitcode@110] Compiling runtime module bitcode...
[I 02/01/20 22:31:18.609] [struct_llvm.cpp:taichi::Tlang::StructCompilerLLVM::run::<lambda_2edad8eb91b2dd8e92352d6e0f41d9de>::operator ()@286] Allocating data structure of size 2097152 B
[I 02/01/20 22:31:18.622] [unified_allocator.cpp:taichi::Tlang::UnifiedAllocator::UnifiedAllocator@17] Allocating unified (CPU+GPU) address space of size 1024 MB
Initializing runtime with 3 snode(s)...
Runtime initialized.
[D 02/01/20 22:31:19.729] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 0
[D 02/01/20 22:31:19.736] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B) 
[D 02/01/20 22:31:19.736] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd005a8000
[D 02/01/20 22:31:19.738] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 1
[D 02/01/20 22:31:19.738] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/01/20 22:31:19.739] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd005ab000
[D 02/01/20 22:31:19.740] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 2
[D 02/01/20 22:31:19.740] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/01/20 22:31:19.740] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd005ae000
[D 02/01/20 22:31:19.742] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 3
[D 02/01/20 22:31:19.742] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 3145728 B (alignment 4096B)
[D 02/01/20 22:31:19.742] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd005b1000
[I 02/01/20 22:31:19.750] [taichi_llvm_context.cpp:taichi::Tlang::TaichiLLVMContext::TaichiLLVMContext@57] Creating llvm context for arch: cuda
[T 02/01/20 22:31:20.267] [taichi_llvm_context.cpp:taichi::Tlang::compile_runtime_bitcode@110] Compiling runtime module bitcode...
[I 02/01/20 22:31:21.014] [C:\notebook\taichi\git\taichi\python\taichi\lang\kernel.py:materialize@180] Compiling kernel paint_c4_0_...
Before preprocessing:
@ti.kernel
def paint(t: ti.f32):
    for i, j in pixels:
        c = ti.Vector([-0.8 - 0.2 * ti.sin(t), ti.sin(t) * 0.2])
        z = ti.Vector([float(i) / n - 1, float(j) / n - 0.5]) * 2
        iterations = 0
        while z.norm() < 40 and iterations < 50:
            z = complex_sqr(z) + c
            iterations += 1
        pixels[i, j] = 1 - iterations * 0.02

After preprocessing:
def paint():
  t = ti.decl_scalar_arg(ti.f32)
  if 1:
    i = ti.Expr(ti.core.make_id_expr(''))
    j = ti.Expr(ti.core.make_id_expr(''))
    ___loop_var = pixels
    ___expr_group = ti.make_expr_group(i, j)
    ti.core.begin_frontend_struct_for(___expr_group, ___loop_var.loop_range
        ().ptr)
    c = ti.expr_init(ti.Vector([-0.8 - 0.2 * ti.sin(t), ti.sin(t) * 0.2]))
    z = ti.expr_init(ti.Vector([ti.ti_float(i) / n - 1, ti.ti_float(j) / n -
        0.5]) * 2)
    iterations = ti.expr_init(0)
    if 1:
      ti.core.begin_frontend_while(ti.Expr(1).ptr)
      __while_cond = ti.expr_init(ti.logical_and(z.norm() < 40, iterations <
          50))
      if 1:
        __cond = __while_cond
        ti.core.begin_frontend_if(ti.Expr(__cond).ptr)
        ti.core.begin_frontend_if_true()
        pass
        ti.core.pop_scope()
        ti.core.begin_frontend_if_false()
        ti.core.insert_break_stmt()
        ti.core.pop_scope()
      z.assign(complex_sqr(z) + c)
      iterations.augassign(1, 'Add')
      ti.core.pop_scope()
      del __while_cond
    ti.subscript(pixels, i, j).assign(1 - iterations * 0.02)
    del iterations
    del z
    del c
    ti.core.end_frontend_range_for()
    del j
    del i

[D 02/01/20 22:31:21.047] [C:\notebook\taichi\git\taichi\python\taichi\lang\kernel.py:__call__@344] Launching kernel paint...
[I 02/01/20 22:31:21.181] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::compile@174] PTX size: 4.34KB
[I 02/01/20 22:31:21.190] [C:\notebook\taichi\git\taichi\python\taichi\lang\kernel.py:materialize@180] Compiling kernel tensor_to_ext_arr_c8_0_...
Before preprocessing:
@ti.kernel
def tensor_to_ext_arr(tensor: ti.template(), arr: ti.ext_arr()):
    for I in ti.grouped(tensor):
        arr[I] = tensor[I]

After preprocessing:
def tensor_to_ext_arr():
  arr = ti.decl_ext_arr_arg(ti.f32, 2)
  if 1:
    ___loop_var = ti.grouped(tensor)
    I = ti.make_var_vector(size=___loop_var.loop_range().dim())
    ___expr_group = ti.make_expr_group(I)
    ti.core.begin_frontend_struct_for(___expr_group, ___loop_var.loop_range
        ().ptr)
    ti.subscript(arr, I).assign(ti.subscript(tensor, I))
    ti.core.end_frontend_range_for()
    del I

[D 02/01/20 22:31:21.202] [C:\notebook\taichi\git\taichi\python\taichi\lang\kernel.py:__call__@344] Launching kernel tensor_to_ext_arr...

What can I do to make it work?

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 20 (12 by maintainers)

Most upvoted comments

Sorry for not mentioning that I made a lot of changes to build it on Win 10. I’d like to list them here if anyone else want to help:

cmake .. -G"Visual Studio 15 2017 Win64"  -DLLVM_ENABLE_RTTI:BOOL=ON -DBUILD_SHARED_LIBS:BOOL=OFF -DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD="X86" -DLLVM_ENABLE_ASSERTIONS=ON -Thost=x64 -DLLVM_BUILD_TESTS:BOOL=OFF -DCMAKE_INSTALL_PREFIX=installed -DCMAKE_PREFIX_PATH="C:\\Users\\Sireer\\llvm-project\\build\\lib\\cmake\\llvm"

I used the script above to make a VS 2017 project. After this, I made several changes in taichi_core and ti projects. I

change CUDA including directory and add pybin11 including directory.
change the linker path of cudart.lib and cuda.lib in taichi_core project.
change the preprocessor TLANG_CUDA_VERSION = '10.0' to TLANG_CUDA_VERSION = "10.0".

add these few lines in the system header winnt.h:

#if (_MSC_VER >= 1915)
#pragma warning(disable:4845)   // __declspec(no_init_all) used but d1initall not set
#endif

+ #if (_MSC_VER >= 1915)
+ #define no_init_all deprecated
+ #endif

comment two lines in the unified_allocator.cpp:

- check_cuda_errors(
-      cudaMemAdvise(_cuda_data, size, cudaMemAdviseSetPreferredLocation, 0));
+ //check_cuda_errors(
+ //    cudaMemAdvise(_cuda_data, size, cudaMemAdviseSetPreferredLocation, 0));

search for all /usr/local/ in the entire VS solution and change the path to the correct path on my computer.

I got the same problem after adding ti.sync() in fractal.py. I also found that if I comment ti.sync() in to_numpy/from_numpy or run the following script, the program will also crash:

import taichi as ti

import numpy as np

ti.cfg.arch = ti.cuda
ti.runtime.print_preprocessed = True
ti.runtime.verbose_kernel_launch = True

n = 4
m = 7


# Taichi tensors
mat = ti.Matrix(3, 4, dt=ti.i32, shape=(n, m))

# Matrix
arr = np.ones(shape=(n, m, 3, 4), dtype=np.int32)

@ti.kernel
def ext_arr_to_matrix(arr: ti.ext_arr(), mat: ti.template(), as_vector: ti.template()):
  for I in ti.grouped(mat):
    for p in ti.static(range(mat.n)):
      for q in ti.static(range(mat.m)):
        if ti.static(as_vector):
          mat[I][p] = arr[I, p]
        else:
          mat[I][p, q] = arr[I, p, q]

@ti.kernel
def matrix_to_ext_arr(mat: ti.template(), arr: ti.ext_arr(), as_vector: ti.template()):
  for I in ti.grouped(mat):
    for p in ti.static(range(mat.n)):
      for q in ti.static(range(mat.m)):
        if ti.static(as_vector):
          arr[I, p] = mat[I][p]
        else:
          arr[I, p, q] = mat[I][p, q]

for i in range(1000):
    print(i)
    ext_arr_to_matrix(arr, mat, False)
    # matrix_to_ext_arr(mat, arr, False)

Sireer on Feb 2, 2020

Everything goes very well! Awesome! 😀

Sireer on Mar 2, 2020

Thanks for reporting! This is really informative.

So there are two issues:

NVIDIA seems to have a different unified memory mechanism on Windows. See also
- http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf (search for “windows”)
- https://devblogs.nvidia.com/unified-memory-cuda-beginners/ (the comment by Christoph Schikora)
- https://devtalk.nvidia.com/default/topic/1044349/cudamallocmanaged-clarification-needed/
- https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY (search for “windows”)
- More investigation needed in this direction: is it really impossible to make CUDA unified memory on Windows behave like on Linux? (Contribution welcome, if anyone has experience with this! 😃. The best case is that we find out a way to use unified memory under Windows, otherwise, we need refactoring to make Taichi work without unified memory.
to_numpy sometimes crashes on Windows when CUDA enabled. Could you confirm that this issue has nothing to do this the GUI system, i.e. to_numpy crashes even no when GUI instance is created?

yuanming-hu on Feb 2, 2020

I suspect this is something related to the memory pool and virtual memory, yet haven’t confirmed.

yuanming-hu on Feb 19, 2020

Here is a thread discussing what sounds like the same problem. they say updating the NVIDIA driver to 418.81 might help.

https://devtalk.nvidia.com/default/topic/1031803/cuda-programming-and-performance/using-unified-memory-causes-system-crash/1

KLozes on Feb 3, 2020

I tried the latest commit. I am not sure if I did it correctly. Here is the code:

import taichi as ti

import numpy as np

ti.cfg.debug = True
ti.runtime.print_preprocessed = True
ti.runtime.verbose_kernel_launch = True
ti.cfg.arch = ti.cuda

n = 4
m = 7


# Taichi tensors
mat = ti.Matrix(3, 4, dt=ti.i32, shape=(n, m))

# Matrix
arr = np.ones(shape=(n, m, 3, 4), dtype=np.int32)

for i in range(100):
    print(i)
    arr = mat.to_numpy()

In most cases, there will be no difference but I did got this log(I mean in some cases, the program will crash without giving any error message.):

[T 02/02/20 11:59:28.614] [logging.cpp:taichi::Logger::Logger@68] Taichi core started. Thread ID = 18956
[Taichi version 0.4.1, cuda 10.0, commit b38a6d41]
0
[I 02/02/20 11:59:28.896] [memory_pool.cpp:taichi::Tlang::MemoryPool::MemoryPool@14] Memory pool created. Default buffer size per allocator = 1024 MB
[I 02/02/20 11:59:29.056] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::CUDAContext@154] Using CUDA Device [id=0]: TITAN Xp
[I 02/02/20 11:59:29.056] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::CUDAContext@162] CUDA Device Compute Capability: 6.1
[I 02/02/20 11:59:29.148] [taichi_llvm_context.cpp:taichi::Tlang::TaichiLLVMContext::TaichiLLVMContext@57] Creating llvm context for arch: x86_64
[I 02/02/20 11:59:29.182] [C:\notebook\taichi\git_latest\taichi\python\taichi\lang\impl.py:materialize@124] Materializing layout...
[D 02/02/20 11:59:29.183] [snode.cpp:taichi::Tlang::SNode::create_node@48] Non-power-of-two node size 7 promoted to 8.
[T 02/02/20 11:59:29.716] [taichi_llvm_context.cpp:taichi::Tlang::compile_runtime_bitcode@110] Compiling runtime module bitcode...
[I 02/02/20 11:59:30.784] [struct_llvm.cpp:taichi::Tlang::StructCompilerLLVM::run::<lambda_2edad8eb91b2dd8e92352d6e0f41d9de>::operator ()@286] Allocating data structure of size 1536 B
[I 02/02/20 11:59:30.796] [unified_allocator.cpp:taichi::Tlang::UnifiedAllocator::UnifiedAllocator@17] Allocating unified (CPU+GPU) address space of size 1024 MB
Initializing runtime with 14 snode(s)...
Runtime initialized.
[D 02/02/20 11:59:31.866] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 0
[D 02/02/20 11:59:31.876] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B) 
[D 02/02/20 11:59:31.877] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003a9000
[D 02/02/20 11:59:31.878] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 1
[D 02/02/20 11:59:31.878] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 11:59:31.879] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003ac000
[D 02/02/20 11:59:31.880] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 2
[D 02/02/20 11:59:31.881] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 11:59:31.881] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003af000
[D 02/02/20 11:59:31.883] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 3
[D 02/02/20 11:59:31.892] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B) 
[D 02/02/20 11:59:31.892] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003b2000
[D 02/02/20 11:59:31.894] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 4
[D 02/02/20 11:59:31.895] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 11:59:31.895] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003b5000
[D 02/02/20 11:59:31.897] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 5
[D 02/02/20 11:59:31.898] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 11:59:31.898] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003b8000
[D 02/02/20 11:59:31.900] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 6
[D 02/02/20 11:59:31.913] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B) 
[D 02/02/20 11:59:31.914] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003bb000
[D 02/02/20 11:59:31.916] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 7
[D 02/02/20 11:59:31.928] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B) 
[D 02/02/20 11:59:31.928] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003be000
[D 02/02/20 11:59:31.930] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 8
[D 02/02/20 11:59:31.930] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 11:59:31.931] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003c1000
[D 02/02/20 11:59:31.933] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 9
[D 02/02/20 11:59:31.945] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B) 
[D 02/02/20 11:59:31.945] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003c4000
[D 02/02/20 11:59:31.947] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 10
[D 02/02/20 11:59:31.947] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 11:59:31.947] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003c7000
[D 02/02/20 11:59:31.949] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 11
[D 02/02/20 11:59:31.960] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B) 
[D 02/02/20 11:59:31.960] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003ca000
[D 02/02/20 11:59:31.962] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 12
[D 02/02/20 11:59:31.962] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 11:59:31.962] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003cd000
[D 02/02/20 11:59:31.964] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 13
[D 02/02/20 11:59:31.964] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 11:59:31.964] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003d0000
[D 02/02/20 11:59:31.966] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory alloc request 14
[D 02/02/20 11:59:31.979] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 3145728 B (alignment 4096B)
[D 02/02/20 11:59:31.979] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003d3000
[I 02/02/20 11:59:31.984] [taichi_llvm_context.cpp:taichi::Tlang::TaichiLLVMContext::TaichiLLVMContext@57] Creating llvm context for arch: cuda
[T 02/02/20 11:59:32.529] [taichi_llvm_context.cpp:taichi::Tlang::compile_runtime_bitcode@110] Compiling runtime module bitcode...
[I 02/02/20 11:59:33.293] [C:\notebook\taichi\git_latest\taichi\python\taichi\lang\kernel.py:materialize@180] Compiling kernel matrix_to_ext_arr_c10_0_...
Before preprocessing:
@ti.kernel
def matrix_to_ext_arr(mat: ti.template(), arr: ti.ext_arr(), as_vector: ti.
    template()):
    for I in ti.grouped(mat):
        for p in ti.static(range(mat.n)):
            for q in ti.static(range(mat.m)):
                if ti.static(as_vector):
                    arr[I, p] = mat[I][p]
                else:
                    arr[I, p, q] = mat[I][p, q]

After preprocessing:
def matrix_to_ext_arr():
  arr = ti.decl_ext_arr_arg(ti.i32, 4)
  if 1:
    ___loop_var = ti.grouped(mat)
    I = ti.make_var_vector(size=___loop_var.loop_range().dim())
    ___expr_group = ti.make_expr_group(I)
    ti.core.begin_frontend_struct_for(___expr_group, ___loop_var.loop_range
        ().ptr)
    for p in ti.static(range(mat.n)):
      for q in ti.static(range(mat.m)):
        if ti.static(as_vector):
          ti.subscript(arr, I, p).assign(ti.subscript(ti.subscript(mat, I), p))
        else:
          ti.subscript(arr, I, p, q).assign(ti.subscript(ti.subscript(mat,
              I), p, q))
    ti.core.end_frontend_range_for()
    del I

[D 02/02/20 11:59:33.335] [C:\notebook\taichi\git_latest\taichi\python\taichi\lang\kernel.py:__call__@344] Launching kernel matrix_to_ext_arr...
[I 02/02/20 11:59:33.520] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::compile@174] PTX size: 6.28KB
1
[D 02/02/20 11:59:33.536] [C:\notebook\taichi\git_latest\taichi\python\taichi\lang\kernel.py:__call__@344] Launching kernel matrix_to_ext_arr...
2
[D 02/02/20 11:59:33.542] [C:\notebook\taichi\git_latest\taichi\python\taichi\lang\kernel.py:__call__@344] Launching kernel matrix_to_ext_arr...
3
[D 02/02/20 11:59:33.545] [C:\notebook\taichi\git_latest\taichi\python\taichi\lang\kernel.py:__call__@344] Launching kernel matrix_to_ext_arr...
[D 02/02/20 11:59:33.550] [C:\notebook\taichi\git_latest\taichi\python\taichi\lang\kernel.py:__call__@344] Launching kernel matrix_to_ext_arr...
5
[D 02/02/20 11:59:33.554] [C:\notebook\taichi\git_latest\taichi\python\taichi\lang\kernel.py:__call__@344] Launching kernel matrix_to_ext_arr...
[E 02/02/20 11:59:33.563] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::launch@209] Cuda Error CUDA_ERROR_UNKNOWN: unknown error
[E 02/02/20 11:59:33.564] Received signal 22 (SIGABRT)

Sireer on Feb 2, 2020

The latest commit on the master branch has fixed the build issues you mentioned. It works on my Windows 10 machine with CUDA 10.1. Feel free to open a PR if you run into any other build errors on your end.

Wow! You are so efficient 👍!

Sireer on Feb 2, 2020

The latest commit on the master branch has fixed the build issues you mentioned. It works on my Windows 10 machine with CUDA 10.1. Feel free to open a PR if you run into any other build errors on your end.

yuanming-hu on Feb 2, 2020

to_numpy as nothing special, except that it forces CUDA to synchronize (i.e. cudaDeviceSynchronize()). Actually, you can crash fractal.py without to_numpy if you call ti.sync() after kernel launches… Not sure what has caused this.

yuanming-hu on Feb 2, 2020

Thanks! There will be the same problem without GUI when calling to_numpy or from_numpy. Actually, this bug happened in the line 28 img = img.to_numpy(), taichi.misc.gui.GUI.set_image. I ran this script and in some cases, the program will break:

m = 7

# Taichi tensors
val = ti.var(ti.i32, shape=(n, m))
vec = ti.Vector(3, dt=ti.i32, shape=(n, m))
mat = ti.Matrix(3, 4, dt=ti.i32, shape=(n, m))

# Scalar
arr = np.ones(shape=(n, m), dtype=np.int32)

val.from_numpy(arr)

arr = val.to_numpy()

# Vector
arr = np.ones(shape=(n, m, 3), dtype=np.int32)

vec.from_numpy(arr)

arr = np.ones(shape=(n, m, 3, 1), dtype=np.int32)
vec.from_numpy(arr)

arr = vec.to_numpy()
assert arr.shape == (n, m, 3, 1)

arr = vec.to_numpy(as_vector=True)
assert arr.shape == (n, m, 3)

# Matrix
arr = np.ones(shape=(n, m, 3, 4), dtype=np.int32)

mat.from_numpy(arr)

arr = mat.to_numpy()
assert arr.shape == (n, m, 3, 4)

The program might break at different locations, a possible log might be:

[T 02/02/20 02:04:14.737] [logging.cpp:taichi::Logger::Logger@68] Taichi core started. Thread ID = 13568
[Taichi version 0.4.0, cuda 10.0, commit 50e621a4]
[I 02/02/20 02:04:14.763] [memory_pool.cpp:taichi::Tlang::MemoryPool::MemoryPool@14] Memory pool created. Default buffer size per allocator = 1024 MB
[I 02/02/20 02:04:14.930] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::CUDAContext@154] Using CUDA Device [id=0]: TITAN Xp
[I 02/02/20 02:04:14.931] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::CUDAContext@162] CUDA Device Compute Capability: 6.1
[I 02/02/20 02:04:15.036] [taichi_llvm_context.cpp:taichi::Tlang::TaichiLLVMContext::TaichiLLVMContext@57] Creating llvm context for arch: x86_64
[I 02/02/20 02:04:15.055] [C:\notebook\taichi\git\taichi\python\taichi\lang\impl.py:materialize@124] Materializing layout...
[D 02/02/20 02:04:15.056] [snode.cpp:taichi::Tlang::SNode::create_node@48] Non-power-of-two node size 7 promoted to 8.
[D 02/02/20 02:04:15.056] [snode.cpp:taichi::Tlang::SNode::create_node@48] Non-power-of-two node size 7 promoted to 8.
[D 02/02/20 02:04:15.056] [snode.cpp:taichi::Tlang::SNode::create_node@48] Non-power-of-two node size 7 promoted to 8.
[D 02/02/20 02:04:15.057] [snode.cpp:taichi::Tlang::SNode::create_node@48] Non-power-of-two node size 20 promoted to 32.
[D 02/02/20 02:04:15.057] [snode.cpp:taichi::Tlang::SNode::create_node@48] Non-power-of-two node size 10 promoted to 16.
[T 02/02/20 02:04:15.592] [taichi_llvm_context.cpp:taichi::Tlang::compile_runtime_bitcode@110] Compiling runtime module bitcode...
[I 02/02/20 02:04:17.292] [struct_llvm.cpp:taichi::Tlang::StructCompilerLLVM::run::<lambda_2edad8eb91b2dd8e92352d6e0f41d9de>::operator ()@286] Allocating data structure of size 4096 B
[I 02/02/20 02:04:17.306] [unified_allocator.cpp:taichi::Tlang::UnifiedAllocator::UnifiedAllocator@17] Allocating unified (CPU+GPU) address space of size 1024 MB
Initializing runtime with 22 snode(s)...
Runtime initialized.
[D 02/02/20 02:04:18.296] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 0
[D 02/02/20 02:04:18.296] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.297] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003a9000
[D 02/02/20 02:04:18.298] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 1
[D 02/02/20 02:04:18.298] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.299] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003ac000
[D 02/02/20 02:04:18.300] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 2
[D 02/02/20 02:04:18.300] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.301] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003af000
[D 02/02/20 02:04:18.302] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 3
[D 02/02/20 02:04:18.302] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.302] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003b2000
[D 02/02/20 02:04:18.304] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 4
[D 02/02/20 02:04:18.304] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.304] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003b5000
[D 02/02/20 02:04:18.306] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 5
[D 02/02/20 02:04:18.316] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B) 
[D 02/02/20 02:04:18.316] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003b8000
[D 02/02/20 02:04:18.318] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 6
[D 02/02/20 02:04:18.318] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.319] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003bb000
[D 02/02/20 02:04:18.320] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 7
[D 02/02/20 02:04:18.320] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.320] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003be000
[D 02/02/20 02:04:18.322] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 8
[D 02/02/20 02:04:18.334] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B) 
[D 02/02/20 02:04:18.334] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003c1000
[D 02/02/20 02:04:18.336] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 9
[D 02/02/20 02:04:18.336] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.337] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003c4000
[D 02/02/20 02:04:18.338] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 10
[D 02/02/20 02:04:18.339] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.339] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003c7000
[D 02/02/20 02:04:18.341] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 11
[D 02/02/20 02:04:18.349] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B) 
[D 02/02/20 02:04:18.349] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003ca000
[D 02/02/20 02:04:18.351] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 12
[D 02/02/20 02:04:18.351] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.351] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003cd000
[D 02/02/20 02:04:18.353] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 13
[D 02/02/20 02:04:18.353] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.353] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003d0000
[D 02/02/20 02:04:18.355] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 14
[D 02/02/20 02:04:18.365] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B) 
[D 02/02/20 02:04:18.365] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003d3000
[D 02/02/20 02:04:18.367] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 15
[D 02/02/20 02:04:18.367] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.367] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003d6000
[D 02/02/20 02:04:18.369] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 16
[D 02/02/20 02:04:18.369] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.369] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003d9000
[D 02/02/20 02:04:18.371] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 17
[D 02/02/20 02:04:18.371] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.371] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003dc000
[D 02/02/20 02:04:18.373] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 18
[D 02/02/20 02:04:18.382] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.382] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003df000
[D 02/02/20 02:04:18.384] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 19
[D 02/02/20 02:04:18.384] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.384] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003e2000
[D 02/02/20 02:04:18.386] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 20
[D 02/02/20 02:04:18.386] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.386] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003e5000
[D 02/02/20 02:04:18.388] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 21
[D 02/02/20 02:04:18.388] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/02/20 02:04:18.388] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003e8000
[D 02/02/20 02:04:18.390] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@103] Processing memory request 22
[D 02/02/20 02:04:18.397] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@110]   Allocating memory 3145728 B (alignment 4096B)
[D 02/02/20 02:04:18.397] [memory_pool.cpp:taichi::Tlang::MemoryPool::daemon@112]   Allocated. Ptr = 0xd003eb000
[I 02/02/20 02:04:18.402] [taichi_llvm_context.cpp:taichi::Tlang::TaichiLLVMContext::TaichiLLVMContext@57] Creating llvm context for arch: cuda
[T 02/02/20 02:04:18.908] [taichi_llvm_context.cpp:taichi::Tlang::compile_runtime_bitcode@110] Compiling runtime module bitcode...
[I 02/02/20 02:04:19.662] [C:\notebook\taichi\git\taichi\python\taichi\lang\kernel.py:materialize@180] Compiling kernel ext_arr_to_tensor_c8_0_...
[I 02/02/20 02:04:19.791] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::compile@174] PTX size: 2.76KB
[I 02/02/20 02:04:19.804] [C:\notebook\taichi\git\taichi\python\taichi\lang\kernel.py:materialize@180] Compiling kernel tensor_to_ext_arr_c6_0_...
[I 02/02/20 02:04:19.949] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::compile@174] PTX size: 2.76KB
[I 02/02/20 02:04:19.964] [C:\notebook\taichi\git\taichi\python\taichi\lang\kernel.py:materialize@180] Compiling kernel ext_arr_to_matrix_c12_0_...
[I 02/02/20 02:04:20.122] [llvm_jit_ptx.cpp:taichi::Tlang::CUDAContext::compile@174] PTX size: 3.27KB
[I 02/02/20 02:04:20.137] [C:\notebook\taichi\git\taichi\python\taichi\lang\kernel.py:materialize@180] Compiling kernel ext_arr_to_matrix_c12_1_...

To make it simple, I tried the script below which only uses to_numpy or from_numpy. There will be the same problem:

import taichi as ti
import numpy as np

ti.cfg.arch = ti.cuda
# ti.get_runtime().set_verbose(False)
ti.runtime.print_preprocessed = True
ti.runtime.verbose_kernel_launch = True

n = 4
m = 7


# Taichi tensors
mat = ti.Matrix(3, 4, dt=ti.i32, shape=(n, m))

# Matrix
arr = np.ones(shape=(n, m, 3, 4), dtype=np.int32)

for i in range(100):
    print(i)
    mat.from_numpy(arr)
    # arr = mat.to_numpy()

The program might break when i=20~40. To be noticed, the program does not break immediately. It seems it will pause for several seconds before it break. Sometimes, it will continue to run 2-3 iterations after the first(the first two) shorter pause(s).

Sireer on Feb 1, 2020