deepspeech.pytorch: Segmentation fault during training (Volta, others)

Training on TED as extracted from python ted.py ..., on AWS p3.2xlarge instance with CUDA 9.0, CuDNN 7.0.3, Ubuntu 16.04, and Python 3.5.4 results in Segmentation fault (core dumped) at some point during the first epoch (usually around 70-80% of the way through the batches), seemingly regardless of batch size (tried 32, 26, 12, and 4; also tried with p3.8xlarge and batch size 20). Worth mentioning, but I did not install MAGMA as per the pytorch conda installation instructions:

# Add LAPACK support for the GPU conda install -c soumith magma-cuda80 # or magma-cuda75 if CUDA 7.5

as it seems that the versions mentioned there are incompatible with CUDA 9.0.

Edit: last output from dmesg

[14531.790543] python[2191]: segfault at 100324c2400 ip 00007f165177a04a sp 00007f15c1c28c98 error 4 in libcuda.so.384.90[7f16515b2000+b1f000]

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 19 (9 by maintainers)

Most upvoted comments

For the first epoch, the batches are samples in increasing sequence length order, so you progressively need more and more memory. What happens is before the failing cudnn call there’s very little memory left on the GPU (in my tests, 300 K, you can confirm by adding

lib = ctypes.cdll.LoadLibrary(None)
free_mem = ctypes.c_long()
total_mem = ctypes.c_long()
lib.cudaMemGetInfo(ctypes.byref(free_mem), ctypes.byref(total_mem))
print("free mem", free_mem.value)

before cudnnRNNForwardTraining call. Cuda requires some free memory for internal operation (stack space for kernels, device memory for events etc) and when so little memory is left something is not handled correctly. (FWIW, same thing can happen on the CPU when you are pushing the memory limits, allocation can be reported as succeeded, and when you try to use it, you’d hang or segfault). My output for the code modified as above, run on a single GPU with the batch size of 64 is

Epoch: [1][515/1422]	Time 1.177 (0.583)	Data 0.008 (0.006)	Loss 172.8604 (119.4332)	
free mem 7252279296
reserve size 801177600 1105817763840 workspace_size 313344000 1107793281024 input size torch.Size([326, 64, 672])
free mem 7252279296
reserve size 801177600 1106620973056 workspace_size 315801600 1107793281024 input size torch.Size([326, 64, 800])
free mem 7252279296
reserve size 801177600 1108502118400 workspace_size 315801600 1107793281024 input size torch.Size([326, 64, 800])
free mem 7252279296
reserve size 801177600 1117253533696 workspace_size 315801600 1107793281024 input size torch.Size([326, 64, 800])
free mem 7252279296
reserve size 801177600 1118056742912 workspace_size 315801600 1107926843392 input size torch.Size([326, 64, 800])
Epoch: [1][516/1422]	Time 0.915 (0.584)	Data 0.008 (0.006)	Loss 151.3297 (119.4950)	
free mem 6446972928
reserve size 803635200 1109305327616 workspace_size 314163200 1107793281024 input size torch.Size([327, 64, 672])
free mem 5641666560
reserve size 803635200 1110110633984 workspace_size 316620800 1107793281024 input size torch.Size([327, 64, 800])
free mem 4836360192
reserve size 803635200 1110915940352 workspace_size 316620800 1107793281024 input size torch.Size([327, 64, 800])
free mem 4031053824
reserve size 803635200 1111721246720 workspace_size 316620800 1107793281024 input size torch.Size([327, 64, 800])
free mem 3225747456
reserve size 803635200 1112526553088 workspace_size 316620800 1107927236608 input size torch.Size([327, 64, 800])
Epoch: [1][517/1422]	Time 0.930 (0.585)	Data 0.007 (0.006)	Loss 165.3099 (119.5836)	
free mem 2420441088
reserve size 803635200 1109305327616 workspace_size 314163200 1107793281024 input size torch.Size([327, 64, 672])
free mem 2420441088
reserve size 803635200 1110110633984 workspace_size 316620800 1107793281024 input size torch.Size([327, 64, 800])
free mem 2420441088
reserve size 803635200 1110915940352 workspace_size 316620800 1107793281024 input size torch.Size([327, 64, 800])
free mem 2420441088
reserve size 803635200 1111721246720 workspace_size 316620800 1107793281024 input size torch.Size([327, 64, 800])
free mem 2420441088
reserve size 803635200 1112526553088 workspace_size 316620800 1107927236608 input size torch.Size([327, 64, 800])
Epoch: [1][518/1422]	Time 0.919 (0.585)	Data 0.008 (0.006)	Loss 179.5324 (119.6994)	
free mem 1613037568
reserve size 806092800 1114137165824 workspace_size 314982400 1107793281024 input size torch.Size([328, 64, 672])
free mem 805634048
reserve size 806092800 1114944569344 workspace_size 317440000 1107793281024 input size torch.Size([328, 64, 800])
free mem 327680
reserve size 806092800 1115751972864 workspace_size 317440000 1107793281024 input size torch.Size([328, 64, 800])
Segmentation fault (core dumped)

Ignore “reserve size” lines, I was checking that allocations for reserve and workspace look ok. So, there are a few things to blame for this outcome

CUDA should handle oom more gracefully than just segfaulting.
IIRC, there are still some unresolved issues with pytorch using more memory for RNNs than it should
Monotonically increasing sequences, as in this example, are particularly bad for caching allocator, it cannot reuse blocks and is forced to constantly free and reallocate them.
Tedlium dataset has pretty long sequences, I segfaulted at less than 40% of the epoch, and sequences were of length 328 already.

Emptying memory allocator cache was exposed recently in the upstream pytorch, may be doing that, or collecting garbage before each iteration will allow you to move a bit further. Or using smaller batch, or using smaller hidden state of LSTM, in short, all the standard things that you would do to try to fit your problem in memory.

ngimel on Nov 8, 2017