tensorflow: Error in `python': double free or corruption (!prev)

I’m consistently getting this error when stopping training (CTRL+C) on version built from head on Jan17. On other hand, running on version from Jan5 head does not exhibit this behavior

tf.__git__version = ‘0.12.1-1934-g27fca7d-dirty’

session.run completed in 0.01 sec with .0.500000 acc
session.run completed in 0.02 sec with .0.000000 acc
^CTraceback (most recent call last):
  File "train.py", line 247, in <module>
    a,_ = sess.run([train_acc,optimizer], feed_dict)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
    return fn(*args)
  File "/home/yaroslav/.conda/envs/tim-jan17/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
    status, run_metadata)
KeyboardInterrupt
*** Error in `python': double free or corruption (!prev): 0x00000000016c55d0 ***
Aborted (core dumped)

Looking at core, it looks like dictionary deletion.

#0  0x00007fe9cbf8a01f in _int_free (av=0x7fe9cc2c9760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3996
#1  0x00007fe9cceb500a in dict_dealloc (mp=0x7fe9558073c8) at Objects/dictobject.c:1596
#2  0x00007fe9cced121f in subtype_dealloc (self=0x7fe95580a080) at Objects/typeobject.c:1193
#3  0x00007fe9cceb023f in free_keys_object (keys=0x24f9620) at Objects/dictobject.c:354
#4  0x00007fe9cced3936 in type_clear (type=0x24f9c68) at Objects/typeobject.c:3270
#5  0x00007fe9ccf8a97c in delete_garbage (old=<optimized out>, collectable=<optimized out>) at Modules/gcmodule.c:866
#6  collect (generation=2, n_collected=0x0, n_uncollectable=0x0, nofail=1) at Modules/gcmodule.c:1014
#7  0x00007fe9ccf8aedd in _PyGC_CollectNoFail () at Modules/gcmodule.c:1605
#8  0x00007fe9ccf5e6d5 in PyImport_Cleanup () at Python/import.c:428
#9  0x00007fe9ccf6a90e in Py_Finalize () at Python/pylifecycle.c:576
#10 0x00007fe9ccf891b9 in Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:789
#11 0x0000000000400add in main (argc=2, argv=0x7ffde1cf3f98) at ./Programs/python.c:65

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 49 (25 by maintainers)

Commits related to this issue

Trying to fix problem described in https://github.com/tensorflow/tensorflow/issues/6968 — committed to davidsandberg/facenet by davidsandberg 7 years ago
Compile numpy from source in Travis Signed-off-by: Karel Ha <mathemage@gmail.com> # https://github.com/tensorflow/tensorflow/issues/6968 — committed to mathemage/tensorflow-tutorial by mathemage 7 years ago
tries fix from https://github.com/tensorflow/tensorflow/issues/6968 — committed to ClimbsRocks/auto_ml by ClimbsRocks 7 years ago
works on saving any number of deep learning models throughout the pipeline travis issues: trying to minimize what we import tries not importing tensorflow fixes two bugs, and tries to fix travis is... — committed to ClimbsRocks/auto_ml by ClimbsRocks 7 years ago
Add tcmalloc to Travis. This was suggested to fix "double free or corruption": https://github.com/tensorflow/tensorflow/issues/6968#issuecomment-279060156 — committed to tensorflow/moonlight by ringw 4 years ago
Uninstall enum34 and reinstall numpy https://github.com/tensorflow/tensorflow/issues/6968#issuecomment-498542543 — committed to recardoso/ISC_HPC_3DGAN by d-gol 3 years ago
Trying to fix problem described in https://github.com/tensorflow/tensorflow/issues/6968 — committed to PythonDevMaster/faceNet by PythonDevMaster 7 years ago

Most upvoted comments

Can confirm this. I ran into this on my CI server and the the following fixed it:

sudo apt-get install libtcmalloc-minimal4
export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"

+138

dennybritz on Feb 10, 2017

To narrow down the issue: someone internally noticed that this crash only happens when numpy is installed with OpenBLAS support on Ubuntu 14.04. I haven’t tested whether upgrading libopenblas fixes it.

So, if you’re on Ubuntu, the workaround is to make sure you don’t have libopenblas-dev installed and pip install --no-binary=:all: numpy

If someone encounters this bug and has time to test out newer versions of libopenblas-dev, that’d be useful.

jhseu on Feb 15, 2017

I use pip install --no-binary=:all: --force-reinstall numpy, solve the problem

northeastsquare on Jun 4, 2019

Can confirm this. I ran into this on my CI server and the the following fixed it:
sudo apt-get install libtcmalloc-minimal4
export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"

I am at Ubuntu16.04 machine. After following the above steps, it still gave the same error.

zgxsin on Feb 5, 2019

I used jemalloc instead of tcmalloc and the problem was also solved on CentOS 6. Just install jemalloc and run with LD_PRELOAD=/usr/local/lib/libjemalloc.so (follow this post: https://zapier.com/engineering/celery-python-jemalloc/)

yijie0710 on Apr 11, 2018

A quick glance at pytorch code: it’s exporting some libstdc++ symbols with RTLD_GLOBAL when it shouldn’t be. The bug is likely in pytorch.

jhseu on Mar 31, 2017

Yeah, Docker caches image versions and we had 14.04 cached. @caisq is planning to upgrade those.

jhseu on Feb 13, 2017

We still don’t think there’s a bug in TensorFlow here. Any Python module that messes with memory allocation can cause this, so perhaps trying importing those last?

jhseu on Nov 2, 2017

I got this same issue trying to run real data benchmark with Horovod on TF 1.4.0rc1, with Open MPI OpenIB transport (which installs memory hooks). TCP transport is unaffected.

Generating model                                                                                                                                                 [992/9693]
*** Error in `python': double free or corruption (!prev): 0x0000000001f721f0 ***
[opusgpu39-wbu2:32804] *** Process received signal ***
[opusgpu39-wbu2:32804] Signal: Aborted (6)
[opusgpu39-wbu2:32804] Signal code:  (-6)
[opusgpu39-wbu2:32804] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f0c06bba890]
[opusgpu39-wbu2:32804] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f0c05f12067]
[opusgpu39-wbu2:32804] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f0c05f13448]
[opusgpu39-wbu2:32804] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x731b4)[0x7f0c05f501b4]
[opusgpu39-wbu2:32804] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x7898e)[0x7f0c05f5598e]
[opusgpu39-wbu2:32804] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x79696)[0x7f0c05f56696]
[opusgpu39-wbu2:32804] [ 6] /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x58)[0x7f0c06dd9958]
[opusgpu39-wbu2:32804] [ 7] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7107)[0x7f0c06bb2107]
[opusgpu39-wbu2:32804] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x721f)[0x7f0c06bb221f]
[opusgpu39-wbu2:32804] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(pthread_join+0xe4)[0x7f0c06bb44d4]
[opusgpu39-wbu2:32804] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSt6thread4joinEv+0x27)[0x7f0b69baa837]
[opusgpu39-wbu2:32804] [11] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(+0x5131d0)[0x7f0b708b61d0]
[opusgpu39-wbu2:32804] [12] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow6thread10ThreadP$
ol4ImplD0Ev+0xbb)[0x7f0b7088d43b]
[opusgpu39-wbu2:32804] [13] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow6thread10ThreadP$
olD1Ev+0x1a)[0x7f0b7088d73a]
[opusgpu39-wbu2:32804] [14] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow10FileSystem16Ge$
MatchingPathsERKSsPSt6vectorISsSaISsEE+0x56c)[0x7f0b708b2aec]
[opusgpu39-wbu2:32804] [15] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow3Env16GetMatchin$
PathsERKSsPSt6vectorISsSaISsEE+0xa3)[0x7f0b708aca43]
[opusgpu39-wbu2:32804] [16] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_Z16GetMatchingFilesRKSsP9TF_S$
atus+0x4b)[0x7f0b7227090b]
[opusgpu39-wbu2:32804] [17] /home/asergeev/mpi/venv-nccl/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x1080604)[0x7f0b72273604]
[opusgpu39-wbu2:32804] [18] python(PyEval_EvalFrameEx+0x614)[0x4cddf4]
[opusgpu39-wbu2:32804] [19] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [20] python(PyEval_EvalFrameEx+0x6500)[0x4d3ce0]
[opusgpu39-wbu2:32804] [21] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [22] python(PyEval_EvalFrameEx+0x5e0a)[0x4d35ea]
[opusgpu39-wbu2:32804] [23] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [24] python(PyEval_EvalFrameEx+0x5e0a)[0x4d35ea]
[opusgpu39-wbu2:32804] [25] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [26] python(PyEval_EvalFrameEx+0x6500)[0x4d3ce0]
[opusgpu39-wbu2:32804] [27] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] [28] python(PyEval_EvalFrameEx+0x6500)[0x4d3ce0]
[opusgpu39-wbu2:32804] [29] python(PyEval_EvalCodeEx+0x401)[0x4cc4f1]
[opusgpu39-wbu2:32804] *** End of error message ***

I was able to make it work by adding -x LD_PRELOAD=/usr/local/lib/libtcmalloc.so.4.4.5.

Seems other folks are still hitting this issue in other use cases, too. Any ideas or plans for the fix?

alsrgv on Oct 29, 2017

For people who run into this type of issue on Arch Linux:

*** glibc detected *** python: double free or corruption (!prev): 0x00000000013fcda0 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3e22e75e66] /lib64/libc.so.6[0x3e22e789b3] /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x67)[0x3e226112f7] /lib64/libpthread.so.0[0x3e2320675d] /lib64/libpthread.so.0[0x3e232078ea] /lib64/libpthread.so.0(pthread_join+0xd4)[0x3e232081f4] … and so on

Install the gperftools package

sudo pacman -S gperftools

It will most likely solve the issue.

tjdevWorks on Sep 24, 2017

I checked and it started working after removing pytorch, so pytorch was the likely culprit.

ghost on Apr 1, 2017

@brando90: Nightly TF docker images are pushed to Docker Hub. Example command line to use them: docker run -it --rm tensorflow/tensorflow:nightly /bin/bash nvidia-docker run -it --rm tensorflow/tensorflow:nightly-gpu /bin/bash nvidia-docker run -it --rm tensorflow/tensorflow:nightly-gpu-py3 /bin/bash

caisq on Feb 23, 2017

Hi, I got a similar Error in "python": double free or corruption (!prev) error. Using tf 0.12, built from source, today with Ubuntu 14.04.

This was solved with @dennybritz 's fix, four posts above. Thank you very much @dennybritz

Will try upgrading to 16.04 as @jhseu recommends

AjayTalati on Feb 11, 2017

I suspect this error is connected to jemalloc since that got added recently. Turning on tcmalloc through export LD_PRELOAD="/usr/lib/libtcmalloc.so.4" gets rid of the error

yaroslavvb on Jan 20, 2017