taichi: pip-installed Taichi crashes on Google colab kernels

Opening an empty CPU-backed notebook at https://colab.research.google.com and running the following code leads to crash:

!apt install clang-7
!apt install clang-format
!pip install taichi-nightly
import taichi as ti

x, y = ti.var(ti.f32), ti.var(ti.f32)

@ti.layout
def xy():
  ti.root.dense(ti.ij, 16).place(x, y)

@ti.kernel
def laplace():
  for i, j in x:
    if (i + j) % 3 == 0:
      y[i, j] = 4.0 * x[i, j] - x[i - 1, j] - x[i + 1, j] - x[i, j - 1] - x[i, j + 1]
    else:
      y[i, j] = 0.0

for i in range(10):
 x[i, i + 1] = 1.0

laplace()

for i in range(10):
  print(y[i, i + 1])

And the relevant runtime logs say:

Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so:
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so:
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Kernel::operator()()
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Kernel::compile()
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Program::compile(taichi::Tlang::Kernel&)
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::KernelCodeGen::compile(taichi::Tlang::Program&, taichi::Tlang::Kernel&)
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::CPUCodeGen::lower_cpp()
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::irpass::lower(taichi::Tlang::IRNode*)
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::LowerAST::visit(taichi::Tlang::Block*)
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so:
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so:
Oct 30, 2019, 3:47:15 PM | WARNING | /lib/x86_64-linux-gnu/libc.so.6: abort
Oct 30, 2019, 3:47:15 PM | WARNING | /lib/x86_64-linux-gnu/libc.so.6: gsignal
Oct 30, 2019, 3:47:15 PM | WARNING | /lib/x86_64-linux-gnu/libc.so.6:
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::signal_handler(int)
Oct 30, 2019, 3:47:15 PM | WARNING | ***************************
Oct 30, 2019, 3:47:15 PM | WARNING | * Taichi Core Stack Trace *
Oct 30, 2019, 3:47:15 PM | WARNING | ***************************
Oct 30, 2019, 3:47:15 PM | WARNING | [E 10/30/19 14:47:15.371] Received signal 6 (Aborted)
Oct 30, 2019, 3:47:15 PM | WARNING | [I 10/30/19 14:47:15.340] [base.cpp:generate_binary@125] Compilation time: 2889.9 ms
Oct 30, 2019, 3:47:12 PM | WARNING | [T 10/30/19 14:47:12.056] [logging.cpp:Logger@67] Taichi core started. Thread ID = 122

Can you please provide some insight into the possible root of the problem if you have it on top of your head?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 69 (65 by maintainers)

Most upvoted comments

FINALLY!!! I identified the problem! Colab kernels have a libtcmalloc library installed and env variable LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 set. Somehow it causes libstdc++ to use libunwind instead of libgcc_s for stack unwinding on exception. For some reason this causes abort during unwinding complex calls.

Running LD_PRELOAD= python t.py, where t.py is some taichi program works, even on GPU kernels. I’m looking for a way to make work inside colab cells as well.

@ppwwyyxx Thanks for pointing this out. I agree that closing stale issues using bots is not a good idea, and will prevent further misuse like this.

@znah After some searching, it turns out that we are now blocked at https://github.com/taichi-dev/taichi/issues/1059 - if we can remove all C++ exceptions (which I believe is necessary), then the system will not involve libunwind and we can run Taichi on colab. It may take some time for people (@sjwsl and @lin-hitonami) to fully remove throw IRModified etc. - if you’d like to help that would be awesome!

@strongoier I stand corrected, it appears the ‘minimal’ taichi code I was using was incorrect (though the lack of error messages makes things a bit hard to decipher). Apologies for pinging you all, seems to work well now 😃 Excited to try taichi out

FYI and off-topic: this opinion from pytorch author: https://twitter.com/soumithchintala/status/1451213207750721538 may lead the maintainers to reconsider whether it’s a good idea to “auto-close stale issues”. I personally agree with his opinion. What’s more valid (and also used in projects I maintained) is to auto-close invalid issues (e.g. those missing necessary information).

Sorry about that. The bitcode loading issue should be fixed in v0.5.6. The buildbots are currently working on compiling/releasing the new version.

I’d like to reopen this issue. The problem is still there, and I think supporting colab environment would greatly increase Taichi user adoption.

Interesting observation from the Colab team: Taichi works when using tcmalloc_minimal instead of tcmalloc. Relevant bits of documentation:

To use TCMalloc, just link TCMalloc into your application via the "-ltcmalloc" linker flag.

You can use TCMalloc in applications you didn't compile yourself, by using LD_PRELOAD:

   $ LD_PRELOAD="/usr/lib/libtcmalloc.so" 
LD_PRELOAD is tricky, and we don't necessarily recommend this mode of usage.

TCMalloc includes a heap checker and heap profiler as well.

If you'd rather link in a version of TCMalloc that does not include the heap profiler and checker (perhaps to reduce binary size for a static binary), you can link in libtcmalloc_minimal instead.

also this

NOTE: When compiling with programs with gcc, that you plan to link
with libtcmalloc, it's safest to pass in the flags

 -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free

when compiling.  gcc makes some optimizations assuming it is using its
own, built-in malloc; that assumption obviously isn't true with
tcmalloc.  In practice, we haven't seen any problems with this, but
the expected risk is highest for users who register their own malloc
hooks with tcmalloc (using gperftools/malloc_hook.h).  The risk is
lowest for folks who use tcmalloc_minimal (or, of course, who pass in
the above flags :-) ).

I’m continuing the investigation.

The real way to rectify this issue is to fix a bug somewhere in either clang, or in (nongnu) libunwind, or in tcmalloc. I don’t feel like being capable to do this. I’ll discuss potential solutions with the Colab team.

It’s even trickier. I suspect some ABI incompatibility between clang and libunwind, that manifests itself only on unwinding complex virtual calls. So quite few programs are probably affected.