audio: free(): invalid pointer | data loader + torchaudio + SoxEffectsChain
I’m experiencing various memory ussues: free(): invalid pointer
, double free or corruption (!prev)
getting printed seemingly from DataLoader, causing training crash. I’m using multithreaded data-loading with torchaudio sox pipeline on AWS p3.8xlarge machine.
If I run the training script under gdb, the crash produces “no stack”. However by setting MALLOC_CHECK_=3, I managed to get a core dump that prints a following stack:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
pytorch/pytorch#1 0x00007f4100f3c801 in __GI_abort () at abort.c:79
pytorch/pytorch#2 0x00007f4100f85897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f41010b2b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
pytorch/pytorch#3 0x00007f4100f8c90a in malloc_printerr (str=str@entry=0x7f41010b0d88 "free(): invalid pointer") at malloc.c:5350
pytorch/pytorch#4 0x00007f4100f8e84c in free_check (mem=<optimized out>, caller=<optimized out>) at hooks.c:274
pytorch/pytorch#5 0x00007f4100f93c27 in __GI___libc_free (mem=0x560e09312b80) at malloc.c:3094
pytorch/pytorch#6 0x00007f40c11c7fa4 in c10::TensorImpl::release_resources() [clone .localalias.182] () from /miniconda/lib/python3.7/site-packages/torch/lib/libc10.so
pytorch/pytorch#7 0x00007f40f22c8014 in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_() () from /miniconda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
pytorch/pytorch#8 0x00007f40f250e42b in THPVariable_clear(THPVariable*) () from /miniconda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
pytorch/pytorch#9 0x00007f40f250e461 in THPVariable_dealloc(THPVariable*) () from /miniconda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
pytorch/pytorch#10 0x0000560dafb0198f in subtype_dealloc () at /tmp/build/80754af9/python_1553721932202/work/Objects/typeobject.c:1256
pytorch/pytorch#11 0x0000560dafb28dc7 in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:1098
pytorch/pytorch#12 0x0000560dafa6a4f9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3930
...
The code mentions PyTorch structures, so maybe torchaudio/SoxEffectsChain are not the root cause, so I report it here as well.
PyTorch version: 1.2.0
cc @SsnL
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 19 (4 by maintainers)
@skrah it seems that https://github.com/pytorch/pytorch/pull/24464 fixed the problem for me. I’ll also verify later today how it works out on AWS.