legion: Non-deterministic hangs while running HTR

I am running a 45-node simulation on Lassen using HTR and the 11/24 Legion commit f4f80752 of control_replication. The runtime hangs non-deterministically, usually after ~10-100 time steps. Attached are two backtraces taken during one of these hangs, taken several minutes apart. Legion is compiled with CXXFLAGS='-g' and the GasNet version is 2021.9.0. HTR is run with DEBUG=1.

bt-first.tar.gz bt-second.tar.gz

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 65 (33 by maintainers)

Most upvoted comments

@elliottslaughter can you add this issue to #1032? I do not know if this is Realm or Legion related but it is definitely top priority,