legion: HTR crash on multiple nodes
Latest control_replication. 4 ranks, 1 rank per node with GPUs. This is non-deterministic only with specific test cases of the solver.
I think this is the same crash as #1415. Feel free to move this back there if you’d prefer.
crash:
legion/runtime/realm/transfer/transfer.cc:161: bool Realm::TransferIteratorBase<N, T>::done() [with int N = 1; T = long long int]: Assertion `inst_impl->metadata.is_valid()' failed.
backtrace:
(gdb) bt
#0 0x00007f88fc82f9fd in nanosleep () from /lib64/libc.so.6
#1 0x00007f88fc82f894 in sleep () from /lib64/libc.so.6
#2 0x00007f88fe13f162 in Realm::realm_freeze (signal=<optimized out>) at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/runtime_impl.cc:200
#3 <signal handler called>
#4 0x00007f88fc7a0387 in raise () from /lib64/libc.so.6
#5 0x00007f88fc7a1a78 in abort () from /lib64/libc.so.6
#6 0x00007f88fc7991a6 in __assert_fail_base () from /lib64/libc.so.6
#7 0x00007f88fc799252 in __assert_fail () from /lib64/libc.so.6
#8 0x00007f88fe17878c in Realm::TransferIteratorBase<1, long long>::done (this=0x7f88b83c6bc0) at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/metadata.h:43
#9 0x00007f88fe18df0a in Realm::TransferIteratorBase<1, long long>::get_addresses (this=0x7f88b83c6bc0, addrlist=..., nonaffine=@0x7f88797ff398: 0x0)
at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/transfer/transfer.cc:523
#10 0x00007f88fe1f833c in Realm::XferDes::get_addresses (this=0x7f88b83d4200, min_xfer_size=8, rseqcache=<optimized out>, in_nonaffine=@0x7f88797ff390: 0x0, out_nonaffine=@0x7f88797ff398: 0x0)
at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/transfer/channel.cc:1509
#11 0x00007f88fe1f8471 in Realm::XferDes::get_addresses (this=<optimized out>, min_xfer_size=<optimized out>, rseqcache=<optimized out>)
at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/transfer/channel.cc:1430
#12 0x00007f88fe1f9a0a in Realm::MemreduceXferDes::progress_xd(Realm::MemreduceChannel*, Realm::TimeLimit) ()
at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/transfer/channel.cc:3472
#13 0x00007f88fe206521 in Realm::XDQueue<Realm::MemreduceChannel, Realm::MemreduceXferDes>::do_work (this=0x3d3dd38, work_until=...)
at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/transfer/channel.inl:53
#14 0x00007f88fe15ca20 in Realm::BackgroundWorkManager::Worker::do_work(long long, Realm::atomic<bool>*) () at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/timers.inl:288
#15 0x00007f88fe233087 in wait_for_work (old_work_counter=<optimized out>, this=0x3c1c800) at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/tasks.cc:1291
#16 Realm::ThreadedTaskScheduler::wait_for_work (this=0x3c1c800, old_work_counter=<optimized out>) at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/tasks.cc:1275
#17 0x00007f88fe2387d3 in Realm::ThreadedTaskScheduler::scheduler_loop() () at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/tasks.cc:1260
#18 0x00007f88fe21f4cf in Realm::UserThread::uthread_entry() () at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/threads.cc:1355
#19 0x00007f88fc7b2190 in ?? () from /lib64/libc.so.6
#20 0x0000000000000000 in ?? ()
@elliottslaughter, please add to #1032, thanks!
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 21 (17 by maintainers)
@elliottslaughter, Can you please add this issue to #1032 ?
@apryakhin I think the idea was to replace this line: https://github.com/StanfordLegion/legion/blob/stable/runtime/realm/transfer/transfer.cc#L161 with something more directly helpful, e.g.:
@cmelone I haven’t tried compiling the above, but hopefully it’ll work. Can you try adding it to your build and running again with the increased logging that was requested?
(edited the code from the first version - I forgot that fatal logging messages automatically terminate the application now)
@lightsighter I can help with debugging this