jax: illegal memory access while two GPUs is used

Hi @skye , this is a continuation of our discussion at issue #3158

I am have implemented an optimization program with JAX. The program works great on a single GPU (TITAN RTX 24GB), but problems occur while multiple multiple GPUs are given. With the error that suggests “illegal memory access”.

Below is the full error description. Please let me know what else can be handy to crack this issue. Thanks, Eyal

2020-06-22 21:17:03.780428: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:426] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2020-06-22 21:17:03.780492: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_client.cc:1484] Execution of replica 0 failed: Internal: Failed to launch CUDA kernel: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-06-22 21:17:03.780554: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_client.cc:1484] Execution of replica 1 failed: Internal: Unable to launch fft for thunk 0x55def2e06330 with type IFFT
Traceback (most recent call last):
  File "/home/name/sp/sp.py", line 201, in <module>
    batch_loss, opt_state = update(opt_state, idx, x, At, Bt)
  File "/home/name/anaconda3/envs/sp/lib/python3.7/site-packages/jax/api.py", line 1165, in f_pmapped
    donated_invars=tuple(donated_invars))
  File "/home/name/anaconda3/envs/sp/lib/python3.7/site-packages/jax/core.py", line 1085, in _call_bind
    outs = primitive.impl(f, *args, **params)
  File "/home/name/anaconda3/envs/sp/lib/python3.7/site-packages/jax/interpreters/pxla.py", line 653, in xla_pmap_impl
    return compiled_fun(*args)
  File "/home/name/anaconda3/envs/sp/lib/python3.7/site-packages/jax/interpreters/pxla.py", line 1092, in execute_replicated
    out_bufs = compiled.execute_on_local_devices(list(input_bufs))
RuntimeError: Internal: Failed to launch CUDA kernel: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered: while running replica 0 and partition 0 of areplicated computation (other replicas may have failed as well).
2020-06-22 21:17:03.869297: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:940] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered :: *** Begin stack trace ***
	PyDict_SetItem
	_PyModule_ClearDict
	PyImport_Cleanup
	Py_FinalizeEx
	_Py_UnixMain
	__libc_start_main
	
*** End stack trace ***

2020-06-22 21:17:03.869484: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:88] Check failed: pair.first->SynchronizeAllActivity() 

Process finished with exit code 134

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (3 by maintainers)

Commits related to this issue

Most upvoted comments

Hi, sorry for the delay! I received your email and am able to repro the error, woohoo. Now to debug it…

@brianwa84 thanks for the pointer. I’m not familiar with the internals of XLA multi-GPU execution either, so your guess sounds as plausible as any 😃 I’ll see if I can verify the device ordinal.