taichi: [CUDA] detected to be supported and crash on card without unified memory

Describe the bug CUDA detected to be SUPPORTED on a machine without CUDA. It’s because is_cuda_api_avaliable returned true even if I don’t have CUDA.

Log/Screenshots

(yuanming-hu/glfw) [bate@archit taichi]$ python examples/mpm128.py  
[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-7haz507t
[Taichi] sandbox prepared
[Taichi] <dev mode>, supported archs: [cpu, cuda, opengl], commit 4e2e5605, python 3.8.2
[Hint] Use WSAD/arrow keys to control gravity. Use left/right mouse bottons to attract/repel. Press R to reset.
[W 04/13/20 09:29:21.266] [cuda_driver.h:call_with_warning@60] CUDA Error CUDA_ERROR_INVALID_DEVICE: invalid device ordinal while calling mem_advise (cuMemAdvise)
[E 04/13/20 09:29:21.860] Received signal 7 (Bus error)


***********************************
* Taichi Compiler Stack Traceback *                                                          
***********************************                                                          
/tmp/taichi-7haz507t/taichi_core.so: taichi::Logger::error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)                                  
/tmp/taichi-7haz507t/taichi_core.so: taichi::signal_handler(int)                             
/usr/lib/libc.so.6(+0x3bd70) [0x7f359062bd70]                                                
/tmp/taichi-7haz507t/taichi_core.so: taichi::lang::MemoryPool::daemon()
/usr/lib/libstdc++.so.6(+0xcfb24) [0x7f357ff41b24]
/usr/lib/libpthread.so.0(+0x946f) [0x7f359021746f]
/usr/lib/libc.so.6: clone
GNU gdb (GDB) 9.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 8383
[New LWP 8388]
[New LWP 8389]
[New LWP 8390]
[New LWP 8391]
[New LWP 8396]
[New LWP 8397]
[New LWP 8398]
[New LWP 8399]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f3580fc910c in llvm::Twine::toVector(llvm::SmallVectorImpl<char>&) const ()
   from /tmp/taichi-7haz507t/taichi_core.so
(gdb) 

To Reproduce Just run the example/mpm128.py.

If you have local commits (e.g. compile fixes before you reproduce the bug), please make sure you first make a PR to fix the build errors and then report the bug.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (3 by maintainers)

Most upvoted comments

Yes, it did.

(gdbtrig) [bate@archit taichi]$ TI_USE_UNIFIED_MEMORY=0 p examples/fractal.py 
[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-mxhjexut
[Taichi] sandbox prepared
[I 04/13/20 09:42:03.300] [cuda_driver.cpp:CUDADriver@30] CUDA DETECTED
[Taichi] <dev mode>, supported archs: [cpu, cuda, opengl], commit 4e2e5605, python 3.8.2
X connection to :0 broken (explicit kill or server shutdown).

The with_cuda still returns true however, according to my TI_INFO("CUDA_DETECTED");.

I guess solution 1 is probably easier. Or we can just ask people not to use too many threads when GPU memory is scarce. (Sorry about my delayed reply - workday starts on my end so I have meetings in the morning…)

Solution 1: device_memory_fraction = 1 / (threads + 1) in test. Solution 2: spinlock until memory enough in test.

Does setting envvar TI_USE_UNIFIED_MEMORY=0 fix your problem?