legion: legion: Seg fault in receive_message on Perlmutter
Im seeing a seg fault in receive_message when I scale S3D on Perlmutter using gasnet1 and the ucx conduit. At 384 nodes I start seeing this:
[39] Thread 8 (Thread 0x7f3d012f8700 (LWP 60436) "s3d.x"):
[39] #0 0x00007f3d19b55217 in waitpid () from /lib64/libc.so.6
[39] #1 0x00007f3d19ad276f in do_system () from /lib64/libc.so.6
[39] #2 0x00007f3d14b864d0 in gasneti_system_redirected () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[39] #3 0x00007f3d14b86c42 in gasneti_bt_gdb () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[39] #4 0x00007f3d14b8a535 in gasneti_print_backtrace () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[39] #5 0x00007f3d14364cdb in gasneti_defaultSignalHandler () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[39] #6 <signal handler called>
[39] #7 0x00007f3d1599e143 in Legion::Internal::MessageManager::receive_message (this=0x1d00560000000001, args=0x7f37fd2b7a70, arglen=48) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:13153
[39] #8 0x00007f3d1599e199 in Legion::Internal::Runtime::process_message_task (this=<optimized out>, args=<optimized out>, arglen=<optimized out>) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:25599
[39] #9 0x00007f3d1599e290 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f37fd2b7a60, arglen=56, userdata=<optimized out>, userlen=<optimized out>, p=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:31076
[39] #10 0x00007f3d145ebac2 in Realm::LocalTaskProcessor::execute_task (this=0x5670190, func_id=4, task_args=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/bytearray.inl:58
[39] #11 0x00007f3d146305c3 in Realm::Task::execute_on_processor (this=0x1963ca50, p=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:306
[39] #12 0x00007f3d14630666 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:1646
[39] #13 0x00007f3d14632cb2 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x1280b010) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:1127
[39] #14 0x00007f3d1463a4a0 in Realm::UserThread::uthread_entry () at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/threads.cc:1337
[39] #15 0x00007f3d19adaca0 in ?? () from /lib64/libc.so.6
[39] #16 0x0000000000000000 in ?? ()
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 23 (23 by maintainers)
I added some “goop” to the cmake config to expose these as parameters and also verify that they are a power-of-two. I’ll provide a PR once I kick the tires a bit more.