jax: cuda failed to allocate errors

When running a a training script using the new memory allocation backend (https://github.com/google/jax/issues/417), I see a bunch of non-fatal errors like this:

[1] 2019-05-29 23:55:55.555823: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 528.00M (553648128 bytes) from 
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.581962: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/cudnn_conv_algorithm_picker.cc:525] Resource exhausted: Failed to 
allocate request for 528.00MiB (553648128B) on device ordinal 0
[7] 2019-05-29 23:55:55.594693: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 528.00M (553648128 bytes) from 
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[7] 2019-05-29 23:55:55.606314: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/cudnn_conv_algorithm_picker.cc:525] Resource exhausted: Failed to 
allocate request for 528.00MiB (553648128B) on device ordinal 0
[1] 2019-05-29 23:55:55.633261: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.14G (1224736768 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.635169: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.05G (1132822528 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.646031: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 561.11M (588365824 bytes) from 
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.647926: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 592.04M (620793856 bytes) from 
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[7] 2019-05-29 23:55:55.655470: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.14G (1224736768 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Is this a known issue? The errors go away when using XLA_PYTHON_CLIENT_ALLOCATOR=platform.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 4
Comments: 32 (7 by maintainers)

Commits related to this issue

[XLA:GPU] Downgrade ERROR log to WARNING and make more user-friendly. Some jax users are hitting this case (https://github.com/google/jax/issues/788), and are confused as to whether it's an actual er... — committed to tensorflow/tensorflow by skye 5 years ago

Most upvoted comments

@christopherhesse if you update to the latest jaxlib (0.1.20, currently only available on Linux for now, let me know if need the Mac build), you should see fewer OOM messages. (https://github.com/tensorflow/tensorflow/commit/701f7e5a24590206c1ff32a50852f6cd040df1af reduces the amount of GPU memory needed in your script, and https://github.com/tensorflow/tensorflow/commit/84e3ae12ba9d6c64efae7884776810825bf82989 suppresses some spurious OOM log messages.) Give it a try?

There’s another issue that I haven’t addressed yet, which is that https://github.com/tensorflow/tensorflow/commit/805b7ccc2ec86c9dd59fa3550c57109a4a71c0d3 reduces GPU memory utilization (with the upshot that jax no longer allocate all your GPU memory up-front). I noticed that this makes your script OOM sooner than it does prior to that change. This is harder to fix; I might just add a toggle to reenable the old behavior for now. I’ll file a separate issue for this once I can better quantify how much worse the utilization is.

skye on Jun 13, 2019

I ended up making it a WARNING, since it can have a significant performance impact. The change is to committed to XLA in https://github.com/tensorflow/tensorflow/commit/1423eab5e000c304f332c2a2a322bee76ca3fdfa, and will be included in the next jaxlib.

@mgbukov the error is referring to GPU memory and GPU convolution algorithms, so you won’t see it on CPU. You might also try the techniques for reducing GPU memory usage as described in https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html.

skye on Nov 12, 2019

Looks like an internal “error” log message that should be downgraded to “info”. Safe to ignore, but I’ll leave this open until we get rid of the spurious error message.

skye on Jun 21, 2019

Awesome, thanks for your patience with this! I’ll go ahead and close the issue.

skye on Jun 21, 2019

@skye the errors are gone, thanks for fixing this!

christopherhesse on Jun 13, 2019

@christopherhesse I’m able to repro with your updated script, thanks! Agreed that these “errors” aren’t necessary, they’re way too noisy and not actionable (since the script still runs, at least for a while). Now I can find out exactly where they’re coming from and hopefully put a stop to them 😃

skye on Jun 11, 2019