legion: Regression in GPU executions

Our GPU test suite last succeeded at Legion commit 61a919f8 and failed last night at da9cefee. I suspect it’s due to this merge?

We’re getting errors like:

prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/psaap-ci/artifacts/5065111388/legion/runtime/realm/cuda/cuda_module.cc:370: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/psaap-ci/artifacts/5065111388/legion/runtime/realm/cuda/cuda_module.cc:370: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
[0 - 7f55f347ec80]    2.785033 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
[0 - 7f55e16f8c80]    2.785032 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
[0 - 7f55f348ac80]    2.785078 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)

I can debug further, just let me know.

@elliottslaughter, could you please add this to https://github.com/StanfordLegion/legion/issues/1032?

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 48 (40 by maintainers)

Most upvoted comments

The fix is out. I am going to be submitting it soon once reviewed.

Yes, will update it here as soon as merge the fix in