legion: [HTR] Segmentation faults at 16 nodes
I am able to run one of our applications (16 ranks, 1 rank per node) on Legion commit cba415a857c2586b2ad2f4848d6d1cd75de7df00.
However, on 9c6c90b9e3857196da2659a29140f2d7686832bb, I get segmentation faults and non-deterministic errors such as:
prometeo_ConstPropMix.exec: prometeo_variables.cc:75: static void UpdatePropertiesFromPrimitiveTask::cpu_base_impl(const UpdatePropertiesFromPrimitiveTask::Args&, const std::vector<Legion::PhysicalRegion>&, const std::vector<Legion::Future>&, Legion::Context, Legion::Runtime*): Assertion `args.mix.CheckMixture(acc_MolarFracs[p])' failed.
[5 - 7fbc93ba8840] 1193.644387 {6}{realm}: invalid event handle: id=7fbcab057570
prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/runtime_impl.cc:2509: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.
This program does run successfully with DEBUG=1. I am actively running this test case with smaller configurations to see if I can reproduce outside of this specific config.
Edit:
16 ranks, 4 ranks per node works
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 105 (87 by maintainers)
@mariodirenzo Pull the most recent control replication, rebuild, regenerate your logs, and then try doing at least the logical analysis verification both of a small run, and a bigger run that fails. The logical analysis verification algorithm should be considerably faster. I did have to change the logging of predicate operations (which you are using) in order to get Legion Spy’s new logical verification algorithm to work, so you will need the latest control replication and newly generated log files.
Be very careful with your reductions, especially if you’re getting raw pointers with your reduction accessors. Make sure you always perform actual reductions and don’t overwrite the reduction buffers (cuNumeric folks had a very nasty privilege violation bug they were doing like this that resulted in non-deterministic failures).
If you think something is wrong with Legion, then Legion Spy is always the sanity check and will validate multi-node runs with control replication now.
Please start by getting a backtrace of the runtime crash with line numbers.