legion: Regression in GPU executions
Our GPU test suite last succeeded at Legion commit 61a919f8 and failed last night at da9cefee. I suspect it’s due to this merge?
We’re getting errors like:
prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/psaap-ci/artifacts/5065111388/legion/runtime/realm/cuda/cuda_module.cc:370: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/psaap-ci/artifacts/5065111388/legion/runtime/realm/cuda/cuda_module.cc:370: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
[0 - 7f55f347ec80] 2.785033 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
[0 - 7f55e16f8c80] 2.785032 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
[0 - 7f55f348ac80] 2.785078 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
I can debug further, just let me know.
@elliottslaughter, could you please add this to https://github.com/StanfordLegion/legion/issues/1032?
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 48 (40 by maintainers)
The fix is out. I am going to be submitting it soon once reviewed.
Yes, will update it here as soon as merge the fix in