legion: Legion: seg fault in dependence analysis

Environment

I ran FlexFlow with the following Legion version on the multi-node environment.

branch: control_replication commit ID: 3354a70192188e5eca7d0c4f96fc3b370541c8fc

Here is the experiment configuration for the example above:

Summit cluster
Number of nodes / GPUs: 4 / 16 (4 GPUs per node)
Number of tasks per GPU: 64
Total number of tasks: 1024 (64 tasks / GPU * 16 GPUs).

Issues

May I get some advice on 1) why this segfault would happen and 2) how to investigate it?

I got a segmentation fault in the following function: Legion::Internal::Operation::perform_registration(unsigned int, Legion::Internal::Operation*, unsigned int, bool&, Legion::Internal::Operation::MappingDependenceTracker*, Legion::Internal::RtEvent) ()

For further detail, here is the full backtrace of the process/thread (that calls nanosleep()) from gdb:

Process backtrace

#0  0x0000200016655e84 in syscall () from /lib64/power9/libc.so.6
#1  0x000020001456e2a4 in Realm::Doorbell::wait_slow() ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/librealm.so.1
#2  0x000020001456fa08 in Realm::UnfairCondVar::wait() ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/librealm.so.1
#3  0x00002000144a4cb8 in Realm::RuntimeImpl::wait_for_shutdown() ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/librealm.so.1
#4  0x00002000144a675c in Realm::Runtime::wait_for_shutdown() ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/librealm.so.1
#5  0x0000200012e632a8 in Legion::Internal::Runtime::start(int, char**, bool, bool) ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/liblegion.so.1
#6  0x0000200012a45ae8 in Legion::Runtime::start(int, char**, bool, bool) ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/liblegion.so.1
#7  0x0000000010008078 in main ()

Thread backtrace

#0  0x000020001661a114 in nanosleep () from /lib64/power9/libc.so.6
#1  0x0000200016619f44 in sleep () from /lib64/power9/libc.so.6
#2  0x00002000144a12a8 in Realm::realm_freeze(int) ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/librealm.so.1
#3  <signal handler called>
#4  0x00002000128b8a6c in std::_Rb_tree<Legion::Internal::RtEvent, Legion::Internal::RtEvent, std::_Identity<Legion::Internal::RtEvent>, std::less<Legion::Internal::RtEvent>, std::allocator<Legion::Internal::RtEvent> >::_M_get_insert_unique_pos(
Legion::Internal::RtEvent const&) ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/liblegion.so.1
#5  0x00002000128b8b78 in std::pair<std::_Rb_tree_iterator<Legion::Internal::RtEvent>, bool> std::_Rb_tree<Legion::Internal::RtEvent, Legion::Internal::RtEvent, std::_Identity<Legion::Internal::RtEvent>, std::less<Legion::Internal::RtEvent>, std
::allocator<Legion::Internal::RtEvent> >::_M_insert_unique<Legion::Internal::RtEvent const&>(Legion::Internal::RtEvent const&) () from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/liblegion.so.1
#6  0x0000200012ae42c4 in Legion::Internal::Operation::perform_registration(unsigned int, Legion::Internal::Operation*, unsigned int, bool&, Legion::Internal::Operation::MappingDependenceTracker*, Legion::Internal::RtEvent) ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/liblegion.so.1
#7  0x0000200012ae724c in Legion::Internal::Operation::register_dependence(Legion::Internal::Operation*, unsigned int) ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/liblegion.so.1
#8  0x0000200012954090 in Legion::Internal::InnerContext::register_implicit_dependences(Legion::Internal::Operation*) ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/liblegion.so.1
#9  0x0000200012a97e2c in Legion::Internal::Operation::begin_dependence_analysis() ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/liblegion.so.1
#10 0x0000200012ad0cd4 in Legion::Internal::Operation::execute_dependence_analysis() ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/liblegion.so.1
#11 0x000020001299aafc in Legion::Internal::InnerContext::process_dependence_stage() ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/liblegion.so.1
#12 0x000020001299ad0c in Legion::Internal::InnerContext::handle_dependence_stage(void const*) ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/liblegion.so.1
#13 0x0000200012e6d7d8 in Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor) () from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/liblegion.so.1
#14 0x000020001447cf30 in Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&) ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/librealm.so.1
#15 0x00002000144d9fc0 in Realm::Task::execute_on_processor(Realm::Processor) ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/librealm.so.1
#16 0x00002000144da184 in Realm::UserThreadTaskScheduler::execute_task(Realm::Task*) ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/librealm.so.1
#17 0x00002000144dd174 in Realm::ThreadedTaskScheduler::scheduler_loop() ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/librealm.so.1
#18 0x00002000144e6790 in Realm::UserThread::uthread_entry() ()
   from /ccs/home/jbsimdicd/gpp/ff-gpp-legion-debug/build/deps/legion/lib/librealm.so.1
#19 0x0000200016587ffc in makecontext () from /lib64/power9/libc.so.6
#20 0x0000000000001c3e in ?? ()

FYI, I had segfault only in multi-node configuration so far; specifically, it deterministically failed when # of nodes = 4, 8 while it worked for # of nodes = 1, 2. Another thing to note for our application is that we have a lot of tasks (64 tasks / GPU, 1024 tasks in total).

Feel free to let me know if you need more info or context.

About this issue

Original URL
State: open
Created a year ago
Comments: 23 (9 by maintainers)

Most upvoted comments

Just run with valgrind first on a single node and see what it says.

lightsighter on Apr 20, 2023

Those backtraces are actually an unrelated bug that has already been fixed. Can you update to the most recent control replication branch?

the minimal one only requires a single node now with much less number of tasks

Now that you can reproduce it on a single node, can you run that configuration with valgrind and see what it reports? Valgrind will be able to tell us as soon as anything does an illegal memory write. Ignore any warning that you get from valgrind, just pay attention to the actual errors.

From my experience, you can increase the chance of SegFault by increasing the first argument (loosely, the number of tasks) to a bigger power of 2 (e.g., 128) as follows (please make sure it’s a power of 2):

That would be consistent with an issue in the transitive reduction. The bigger the graph of instructions to optimize the more it will walk off the end of some data structure and corrupt the heap.

lightsighter on Apr 19, 2023

If you think it is an out-of-bounds access and you’re using Legion FieldAccessors for all your accesses then you can turn on bounds checks. It will slow things down but will report any out-of-bounds accesses. If you’re getting raw pointers from your accessors then the analysis is no longer sound.

FWIW, I suspect that an out-of-bounds access from an accessor is not very likely to cause heap corruption (unless you’ve attached lots of external memory allocations to logical regions) because the part of the heap where instances are is usually isolated by a lot of distance from the rest of the heap (Realm pre-allocates all the space in its memories). Heap corruption of STL data structures more likely comes from something use doing something bad with an STL (or STL-like) data structure somewhere else in the program.

lightsighter on Apr 7, 2023

Other than turning on bounds checks (only valid if you aren’t getting raw pointers from your accessors), there’s not much you’re going to be able to do to look for memory corruption in a multi-node setting since you can’t run valgrind. I suggest trying to make a minimal reproducer.

lightsighter on Apr 3, 2023