jax: cuda failed to allocate errors
When running a a training script using the new memory allocation backend (https://github.com/google/jax/issues/417), I see a bunch of non-fatal errors like this:
[1] 2019-05-29 23:55:55.555823: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 528.00M (553648128 bytes) from
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.581962: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/cudnn_conv_algorithm_picker.cc:525] Resource exhausted: Failed to
allocate request for 528.00MiB (553648128B) on device ordinal 0
[7] 2019-05-29 23:55:55.594693: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 528.00M (553648128 bytes) from
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[7] 2019-05-29 23:55:55.606314: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/cudnn_conv_algorithm_picker.cc:525] Resource exhausted: Failed to
allocate request for 528.00MiB (553648128B) on device ordinal 0
[1] 2019-05-29 23:55:55.633261: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.14G (1224736768 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.635169: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.05G (1132822528 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.646031: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 561.11M (588365824 bytes) from
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.647926: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 592.04M (620793856 bytes) from
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[7] 2019-05-29 23:55:55.655470: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.14G (1224736768 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Is this a known issue? The errors go away when using XLA_PYTHON_CLIENT_ALLOCATOR=platform
.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 4
- Comments: 32 (7 by maintainers)
@christopherhesse if you update to the latest jaxlib (0.1.20, currently only available on Linux for now, let me know if need the Mac build), you should see fewer OOM messages. (https://github.com/tensorflow/tensorflow/commit/701f7e5a24590206c1ff32a50852f6cd040df1af reduces the amount of GPU memory needed in your script, and https://github.com/tensorflow/tensorflow/commit/84e3ae12ba9d6c64efae7884776810825bf82989 suppresses some spurious OOM log messages.) Give it a try?
There’s another issue that I haven’t addressed yet, which is that https://github.com/tensorflow/tensorflow/commit/805b7ccc2ec86c9dd59fa3550c57109a4a71c0d3 reduces GPU memory utilization (with the upshot that jax no longer allocate all your GPU memory up-front). I noticed that this makes your script OOM sooner than it does prior to that change. This is harder to fix; I might just add a toggle to reenable the old behavior for now. I’ll file a separate issue for this once I can better quantify how much worse the utilization is.
I ended up making it a WARNING, since it can have a significant performance impact. The change is to committed to XLA in https://github.com/tensorflow/tensorflow/commit/1423eab5e000c304f332c2a2a322bee76ca3fdfa, and will be included in the next jaxlib.
@mgbukov the error is referring to GPU memory and GPU convolution algorithms, so you won’t see it on CPU. You might also try the techniques for reducing GPU memory usage as described in https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html.
Looks like an internal “error” log message that should be downgraded to “info”. Safe to ignore, but I’ll leave this open until we get rid of the spurious error message.
Awesome, thanks for your patience with this! I’ll go ahead and close the issue.
@skye the errors are gone, thanks for fixing this!
@christopherhesse I’m able to repro with your updated script, thanks! Agreed that these “errors” aren’t necessary, they’re way too noisy and not actionable (since the script still runs, at least for a while). Now I can find out exactly where they’re coming from and hopefully put a stop to them 😃