taichi: Starting multiple Taichi instances causes bus error under development mode

Describe the bug Starting multiple Taichi instances simultaneously causes Fatal Python error: Bus error I.e., see https://github.com/taichi-dev/taichi/issues/481#issuecomment-586720382

To Reproduce This almost always happens when we use multithreaded testing with >= 4 threads. (Clearly, the more threads, the higher crashing probability.)

Cause In development mode, taichi will create a copy of build/libtaichi_core.so into build/taichi_core.so. This is to ensure writing into build/libtaichi_core.so (i.e. when you are compiling Taichi itself) does not crash any running taichi instances, which depends on build/taichi_core.so.

However, when starting two taichi instances, they might fight with each other. Specifically, instance A is trying to import build/taichi_core.so, yet instance B is removing the current build/taichi_core.so and creating its own version. This causes the shared object being loaded by instance A deleted, and a bus error.

How to fix Create a folder for each process, with folder name being process id + current time + a random number etc, so that each taichi instance has a different sandbox for build/taichi_core.so. You’ll have to modify here

https://github.com/taichi-dev/taichi/blob/97dbf64f735598e64dc13690e5f237dedf20f091/python/taichi/core/util.py#L178

Actually, this has been a known issue for a long time, but I totally forget…: https://github.com/taichi-dev/taichi/blob/97dbf64f735598e64dc13690e5f237dedf20f091/python/taichi/core/util.py#L204

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 15 (5 by maintainers)

Commits related to this issue

Most upvoted comments

To chime in, the Metal backend also doesn’t generate any temporary files. The solution is similar to the OpengGL backend – Apple has this newLibraryWithSource:options:completionHandler: API https://developer.apple.com/documentation/metal/mtldevice/1433351-newlibrarywithsource()

A new problem raised from #501: too many /tmp/taichi-* is making my /tmp full!!! We want to use atexit.register(lambda: os.unlink(tmp_dir))!

Thanks for your suggestion! Yeah all runtime generated files should be put in sandbox. But to OpenGL backend, that’s not a problem: Some backends invokes a program to compile, while OpenGL use an API called glShaderSource to directly take const char *src as argument and no temp file is needed! But I don’t know if other backends have used temp files, any idea? @yuanming-hu

Thanks archibate. Actually, I think source-to-source backends are also still going to run into problems since they emit code to temp files. These should also be moved to the temp dir. It should be easy to pass the temp dir path over to CodeGenBase.

Yeah, I also find the bitcode compilation to be problematic when multiple instances start (under development mode only). We should consider moving this to the tmp dir as well. An easy solution is to pass in the tmpdir generated by Python via pybind11 (set_tmp_dir(std::string)) and save that value in CoreState. Then when we compiler the runtime bitcode just use that tmp_dir.

Maybe related. Running a Taichi application with mpirun fails most of the time since multiple ranks are trying to compile and load the same runtime file. Could we move this inside the sandbox as well?

running: mpirun -np 4 python3 laplace.py

leads to:

[taichi] prepared sandbox at /tmp/taichi-3odr7ws9
[taichi] prepared sandbox at /tmp/taichi-7c93liyx
[taichi] prepared sandbox at /tmp/taichi-bm011fe0
[taichi] prepared sandbox at /tmp/taichi-lcitw8x2
[Taichi version 0.5.2, cpu only, commit a8490052]
[Taichi version 0.5.2, cpu only, commit a8490052]
[Taichi version 0.5.2, cpu only, commit a8490052]
[Taichi version 0.5.2, cpu only, commit a8490052]
[W 02/22/20 08:46:11.554] [taichi_llvm_context.cpp:module_from_bitcode_file@170] Bitcode loading error message:
[E 02/22/20 08:46:11.554] [taichi_llvm_context.cpp:module_from_bitcode_file@172] Bitcode /home/klozes/Documents/software/taichi/taichi/runtime//runtime_x86_64.bc load failure.
[E 02/22/20 08:46:11.554] Received signal 6 (Aborted)
Invalid bitcode signature
***********************************
* Taichi Compiler Stack Traceback *
***********************************
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::signal_handler(int)
/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f367b02cf20]
/lib/x86_64-linux-gnu/libc.so.6: gsignal
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::Tlang::module_from_bitcode_file(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::LLVMContext*)
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::Tlang::TaichiLLVMContext::clone_runtime_module()
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::Tlang::TaichiLLVMContext::get_init_module()
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::Tlang::StructCompilerLLVM::StructCompilerLLVM(taichi::Tlang::Program*, taichi::Tlang::Arch)
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::Tlang::StructCompiler::make(bool, taichi::Tlang::Program*, taichi::Tlang::Arch)
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::Tlang::Program::materialize_layout()
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::Tlang::layout(std::function<void ()> const&)
/tmp/taichi-3odr7ws9/taichi_core.so(+0x84db29) [0x7f3654fc9b29]
/tmp/taichi-3odr7ws9/taichi_core.so(+0x63e484) [0x7f3654dba484]
python3() [0x50abc5]
python3(_PyEval_EvalFrameDefault+0x449) [0x50c549]
python3() [0x5081d5]
python3() [0x50a020]
python3() [0x50aa1d]
python3(_PyEval_EvalFrameDefault+0x449) [0x50c549]
python3(_PyFunction_FastCallDict+0xf5) [0x5093e5]
python3() [0x5951c1]
python3(PyObject_Call+0x3e) [0x5a04ce]
python3() [0x557878]
python3() [0x541d40]
python3(_PyEval_EvalFrameDefault+0xed8) [0x50cfd8]
python3() [0x5081d5]
python3(PyEval_EvalCode+0x23) [0x50b3a3]
python3() [0x635082]
python3(PyRun_FileExFlags+0x97) [0x635137]
python3(PyRun_SimpleFileExFlags+0x17f) [0x6388ef]
python3(Py_Main+0x591) [0x639491]
python3(main+0xe0) [0x4b0f60]
/lib/x86_64-linux-gnu/libc.so.6: __libc_start_main
python3(_start+0x2a) [0x5b2eaa]

I think solution 1 better. Even better if you remove this feature, since no people want to build while test running. If they want, changes should be in Makefiles instead of taichi. Also note that rewriteing meanless data again and again is not friendly to SSD user like me.

Using file timestamp is a great idea to avoid unnecessary copies! We will need some file locking for safety though.

Alternate solution 1: Instead of making another taichi_core.so copy, we add write-protection to libtaichi_core.so.

Alternate solution 2: Instead of making a lot of taichi_core.so copy for each instance with same contents, we just make one taichi_core.so. And not to replace it until timestamp(libtaichi_core.so) > timestamp(taichi_core.so). This also helps taichi startup more quicker.