tensorflow: Error in `python': double free or corruption (!prev)

I’m consistently getting this error when stopping training (CTRL+C) on version built from head on Jan17. On other hand, running on version from Jan5 head does not exhibit this behavior

tf.__git__version = ‘0.12.1-1934-g27fca7d-dirty’

session.run completed in 0.01 sec with .0.500000 acc
session.run completed in 0.02 sec with .0.000000 acc
^CTraceback (most recent call last):
  File "train.py", line 247, in <module>
    a,_ = sess.run([train_acc,optimizer], feed_dict)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
    return fn(*args)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
    status, run_metadata)
KeyboardInterrupt
*** Error in `python': double free or corruption (!prev): 0x00000000016c55d0 ***
Aborted (core dumped)

Looking at core, it looks like dictionary deletion.

#0  0x00007fe9cbf8a01f in _int_free (av=0x7fe9cc2c9760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3996
#1  0x00007fe9cceb500a in dict_dealloc (mp=0x7fe9558073c8) at Objects/dictobject.c:1596
#2  0x00007fe9cced121f in subtype_dealloc (self=0x7fe95580a080) at Objects/typeobject.c:1193
#3  0x00007fe9cceb023f in free_keys_object (keys=0x24f9620) at Objects/dictobject.c:354
#4  0x00007fe9cced3936 in type_clear (type=0x24f9c68) at Objects/typeobject.c:3270
#5  0x00007fe9ccf8a97c in delete_garbage (old=<optimized out>, collectable=<optimized out>) at Modules/gcmodule.c:866
#6  collect (generation=2, n_collected=0x0, n_uncollectable=0x0, nofail=1) at Modules/gcmodule.c:1014
#7  0x00007fe9ccf8aedd in _PyGC_CollectNoFail () at Modules/gcmodule.c:1605
#8  0x00007fe9ccf5e6d5 in PyImport_Cleanup () at Python/import.c:428
#9  0x00007fe9ccf6a90e in Py_Finalize () at Python/pylifecycle.c:576
#10 0x00007fe9ccf891b9 in Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:789
#11 0x0000000000400add in main (argc=2, argv=0x7ffde1cf3f98) at ./Programs/python.c:65

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 49 (25 by maintainers)

Commits related to this issue

Most upvoted comments

Can confirm this. I ran into this on my CI server and the the following fixed it:

sudo apt-get install libtcmalloc-minimal4
export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"

To narrow down the issue: someone internally noticed that this crash only happens when numpy is installed with OpenBLAS support on Ubuntu 14.04. I haven’t tested whether upgrading libopenblas fixes it.

So, if you’re on Ubuntu, the workaround is to make sure you don’t have libopenblas-dev installed and pip install --no-binary=:all: numpy

If someone encounters this bug and has time to test out newer versions of libopenblas-dev, that’d be useful.

I use pip install --no-binary=:all: --force-reinstall numpy, solve the problem

Can confirm this. I ran into this on my CI server and the the following fixed it:

sudo apt-get install libtcmalloc-minimal4
export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"

I am at Ubuntu16.04 machine. After following the above steps, it still gave the same error.

I used jemalloc instead of tcmalloc and the problem was also solved on CentOS 6. Just install jemalloc and run with LD_PRELOAD=/usr/local/lib/libjemalloc.so (follow this post: https://zapier.com/engineering/celery-python-jemalloc/)

A quick glance at pytorch code: it’s exporting some libstdc++ symbols with RTLD_GLOBAL when it shouldn’t be. The bug is likely in pytorch.

Yeah, Docker caches image versions and we had 14.04 cached. @caisq is planning to upgrade those.

We still don’t think there’s a bug in TensorFlow here. Any Python module that messes with memory allocation can cause this, so perhaps trying importing those last?

I got this same issue trying to run real data benchmark with Horovod on TF 1.4.0rc1, with Open MPI OpenIB transport (which installs memory hooks). TCP transport is unaffected.

Generating model                                                                                                                                                 [992/9693]
*** Error in `python': double free or corruption (!prev): 0x0000000001f721f0 ***
[opusgpu39-wbu2:32804] *** Process received signal ***
[opusgpu39-wbu2:32804] Signal: Aborted (6)
[opusgpu39-wbu2:32804] Signal code:  (-6)
[opusgpu39-wbu2:32804] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f0c06bba890]
[opusgpu39-wbu2:32804] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f0c05f12067]
[opusgpu39-wbu2:32804] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f0c05f13448]
[opusgpu39-wbu2:32804] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x731b4)[0x7f0c05f501b4]
[opusgpu39-wbu2:32804] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x7898e)[0x7f0c05f5598e]
[opusgpu39-wbu2:32804] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x79696)[0x7f0c05f56696]
[opusgpu39-wbu2:32804] [ 6] /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x58)[0x7f0c06dd9958]
[opusgpu39-wbu2:32804] [ 7] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7107)[0x7f0c06bb2107]
[opusgpu39-wbu2:32804] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x721f)[0x7f0c06bb221f]
[opusgpu39-wbu2:32804] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(pthread_join+0xe4)[0x7f0c06bb44d4]
[opusgpu39-wbu2:32804] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSt6thread4joinEv+0x27)[0x7f0b69baa837]
[opusgpu39-wbu2:32804] [11] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(+0x5131d0)[0x7f0b708b61d0]
[opusgpu39-wbu2:32804] [12] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow6thread10ThreadP$
ol4ImplD0Ev+0xbb)[0x7f0b7088d43b]
[opusgpu39-wbu2:32804] [13] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow6thread10ThreadP$
olD1Ev+0x1a)[0x7f0b7088d73a]
[opusgpu39-wbu2:32804] [14] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow10FileSystem16Ge$
MatchingPathsERKSsPSt6vectorISsSaISsEE+0x56c)[0x7f0b708b2aec]
[opusgpu39-wbu2:32804] [15] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow3Env16GetMatchin$
PathsERKSsPSt6vectorISsSaISsEE+0xa3)[0x7f0b708aca43]
[opusgpu39-wbu2:32804] [16] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_Z16GetMatchingFilesRKSsP9TF_S$
atus+0x4b)[0x7f0b7227090b]
[opusgpu39-wbu2:32804] [17] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x1080604)[0x7f0b72273604]
[opusgpu39-wbu2:32804] [18] python(PyEval_EvalFrameEx+0x614)[0x4cddf4]
[opusgpu39-wbu2:32804] [19] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [20] python(PyEval_EvalFrameEx+0x6500)[0x4d3ce0]
[opusgpu39-wbu2:32804] [21] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [22] python(PyEval_EvalFrameEx+0x5e0a)[0x4d35ea]
[opusgpu39-wbu2:32804] [23] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [24] python(PyEval_EvalFrameEx+0x5e0a)[0x4d35ea]
[opusgpu39-wbu2:32804] [25] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [26] python(PyEval_EvalFrameEx+0x6500)[0x4d3ce0]
[opusgpu39-wbu2:32804] [27] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [28] python(PyEval_EvalFrameEx+0x6500)[0x4d3ce0]
[opusgpu39-wbu2:32804] [29] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] *** End of error message ***

I was able to make it work by adding -x LD_PRELOAD=/usr/local/lib/libtcmalloc.so.4.4.5.

Seems other folks are still hitting this issue in other use cases, too. Any ideas or plans for the fix?

For people who run into this type of issue on Arch Linux:

*** glibc detected *** python: double free or corruption (!prev): 0x00000000013fcda0 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3e22e75e66] /lib64/libc.so.6[0x3e22e789b3] /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x67)[0x3e226112f7] /lib64/libpthread.so.0[0x3e2320675d] /lib64/libpthread.so.0[0x3e232078ea] /lib64/libpthread.so.0(pthread_join+0xd4)[0x3e232081f4] … and so on

Install the gperftools package

sudo pacman -S gperftools

It will most likely solve the issue.

I checked and it started working after removing pytorch, so pytorch was the likely culprit.

@brando90: Nightly TF docker images are pushed to Docker Hub. Example command line to use them: docker run -it --rm tensorflow/tensorflow:nightly /bin/bash nvidia-docker run -it --rm tensorflow/tensorflow:nightly-gpu /bin/bash nvidia-docker run -it --rm tensorflow/tensorflow:nightly-gpu-py3 /bin/bash

Hi, I got a similar Error in "python": double free or corruption (!prev) error. Using tf 0.12, built from source, today with Ubuntu 14.04.

This was solved with @dennybritz 's fix, four posts above. Thank you very much @dennybritz

Will try upgrading to 16.04 as @jhseu recommends

I suspect this error is connected to jemalloc since that got added recently. Turning on tcmalloc through export LD_PRELOAD="/usr/lib/libtcmalloc.so.4" gets rid of the error