legion: [HTR] Control replication violation

This issue comes up with two configurations: 16 nodes, 4 ranks per node and 16 nodes, 1 rank per node. This is with GPUs.

Both configurations fail in different ways in release mode.

For both in debug mode, these are the errors:

[15 - 20009d04f8b0]    7.559467 {5}{runtime}: Detected control replication violation when invoking from_value in task workSingle (UID 31) on shard 15 [Provenance: unknown]. The hash summary for the function does not align with the hash summaries from other call sites. We'll run the hash algorithm again to try to recognize what value differs between the shards, hang tight...
[4 - 20009d04f8b0]    7.575286 {5}{runtime}: [error 607] LEGION ERROR: Specific control replication violation occurred from member future (from file /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion_context.cc:13823)

Backtrace:

#0  0x000020000601eb88 in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x000020000601e8bc in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:137
#2  0x0000200003636304 in Realm::realm_freeze (signal=6) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/realm/runtime_impl.cc:206
#3  <signal handler called>
#4  0x0000200005f7fcb0 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#5  0x0000200005f8200c in __GI_abort () at abort.c:90
#6  0x0000200002b4b8b8 in Legion::Internal::Runtime::report_error_message (id=607, file_name=0x2000046be1d0 "/usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion_context.cc", line=13823, 
    message=0x20009cf86f68 "Specific control replication violation occurred from member future") at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/runtime.cc:32133
#7  0x0000200001f70c78 in Legion::Internal::ReplicateContext::verify_hash (this=0x2000a800f520, hash=0x20009cf87fe0, description=0x2000046c43b0 "future", provenance=0x0, verify_every_call=true)
    at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion_context.cc:13821
#8  0x0000200001fbab54 in Legion::Internal::ReplicateContext::HashVerifier::verify (this=0x20009cf88138, description=0x2000046c43b0 "future", every_call=true)
    at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion_context.h:2445
#9  0x0000200001fbaab8 in Legion::Internal::ReplicateContext::HashVerifier::hash (this=0x20009cf88138, value=0x2000b65c2bc0, size=64, description=0x2000046c43b0 "future")
    at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion_context.h:2439
#10 0x0000200001f6e6bc in Legion::Internal::ReplicateContext::hash_future (this=0x2000a800f520, hasher=..., safe_level=2, future=..., description=0x2000046c43b0 "future")
    at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion_context.cc:13449
#11 0x0000200001f6b3cc in Legion::Internal::ReplicateContext::from_value (this=0x2000a800f520, value=0x2000b65c40a0, size=64, owned=false, provenance=0x0, shard_local=false)
    at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion_context.cc:12881
#12 0x0000200001baa030 in Legion::Future::from_untyped_pointer (value=0x2000b65c40a0, value_size=64, owned=false, prov=0x0, shard_local=false) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion.cc:2502
#13 0x0000200001bc596c in Legion::LegionSerialization::from_value_helper (value=0x2000b65c40a0, value_size=64) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion.inl:69
#14 0x0000200001bef2d0 in Legion::LegionSerialization::NonPODSerializer<Legion::Domain, false, false>::from_value (value=0x2000b65c40a0) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion.inl:256
#15 0x0000200001be5f00 in Legion::LegionSerialization::StructHandler<Legion::Domain, true>::from_value (value=0x2000b65c40a0) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion.inl:305
#16 0x0000200001bdb51c in Legion::LegionSerialization::from_value<Legion::Domain> (value=0x2000b65c40a0) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion.inl:348
#17 0x0000200001bd0a0c in Legion::Future::from_value<Legion::Domain> (value=...) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion.inl:19583
Python Exception <type 'exceptions.ValueError'> Cannot find type const std::map<Legion::DomainPoint, Legion::Domain, std::less<Legion::DomainPoint>, std::allocator<std::pair<Legion::DomainPoint const, Legion::Domain> > >::_Rep_type: 
#18 0x0000200001bb3130 in Legion::Runtime::create_partition_by_domain (this=0x1af965e0, ctx=0x2000a800f520, parent=..., domains=std::map with 64 elements, color_space=..., perform_intersections=true, part_kind=LEGION_ALIASED_KIND, 
    color=4294967295, prov=0x0) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion.cc:4319
#19 0x0000200001c0d2f4 in legion_index_partition_create_multi_domain_point_coloring (runtime_=..., ctx_=..., parent_=..., color_space_=..., coloring_=..., part_kind=LEGION_ALIASED_KIND, color=4294967295)
    at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/legion/legion_c.cc:1291
#20 0x000000001019fb94 in $<workSingle> () at ...n-latest/gpu-debug/language/terra.build/src/terralib.lua:2169
#21 0x0000000010170de4 in $__regent_task_workSingle_primary () at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/language/src/regent/std_base.t:1218
#22 0x0000200003b387e8 in Realm::LocalTaskProcessor::execute_task (this=0x19981230, func_id=5110, task_args=...) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/realm/proc_impl.cc:1176
#23 0x0000200003874bb4 in Realm::Task::execute_on_processor (this=0x2000a8013c30, p=...) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/realm/tasks.cc:326
#24 0x0000200003879e6c in Realm::KernelThreadTaskScheduler::execute_task (this=0x199815c0, task=0x2000a8013c30) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/realm/tasks.cc:1421
#25 0x0000200003878890 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x199815c0) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/realm/tasks.cc:1160
#26 0x0000200003878f88 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x199815c0) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/realm/tasks.cc:1272
#27 0x0000200003885868 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x199815c0)
    at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/realm/threads.inl:97
#28 0x00002000038330c8 in Realm::KernelThread::pthread_entry (data=0x1f0c3a40) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1826263/legion/runtime/realm/threads.cc:831
#29 0x0000200006168cd4 in start_thread (arg=0x20009d04f8b0) at pthread_create.c:309
#30 0x0000200006067f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

I think that these problems might be too large to fit on Sapling.

VortexAdvection2D 32x2x1 (16 nodes, 4 ranks per node)

3DPeriodic_Air 32x2x2 (16 nodes, 1 rank per node)

reproduced on Shepard and Lassen.

About this issue

  • Original URL
  • State: open
  • Created 2 months ago
  • Comments: 16 (14 by maintainers)

Most upvoted comments

Right, I added more checking. Just because the checks weren’t there before doesn’t mean that your code was correct.