legion: Realm::TimeLimit floating point exception

Our FleCSI application runs fine on 2 GPUs (1 per rank), but at 3 and 4 GPUs, realm throws a floating point exception. Here is the backtrace

(gdb) bt
#0  0x00001471c73c5cc1 in [clock_nanosleep@GLIBC_2.2.5](mailto:clock_nanosleep@GLIBC_2.2.5) () from /lib64/libc.so.6
#1  0x00001471c73cb9c3 in nanosleep () from /lib64/libc.so.6
#2  0x00001471c73cb8da in sleep () from /lib64/libc.so.6
#3  0x00001471cdcb8bea in Realm::realm_freeze(int) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#4  <signal handler called>
#5  0x00001471cdd063a2 in Realm::Cuda::GPUIndirectXferDes::progress_xd(Realm::Cuda::GPUIndirectChannel*, Realm::TimeLimit) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#6  0x00001471cdd0e05d in Realm::XDQueue<Realm::Cuda::GPUIndirectChannel, Realm::Cuda::GPUIndirectXferDes>::do_work(Realm::TimeLimit) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#7  0x00001471cdc17b01 in Realm::BackgroundWorkManager::Worker::do_work(long long, Realm::atomic<bool>*) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#8  0x00001471cdc174b8 in Realm::BackgroundWorkThread::main_loop() ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#9  0x00001471cdcdff7e in Realm::KernelThread::pthread_entry(void*) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#10 0x00001471ccf986ea in start_thread () from /lib64/libpthread.so.0
#11 0x00001471c7401a6f in clone () from /lib64/libc.so.6

This run used the following commit:

commit 21500c7e3eb7f123b8e6c3ec2cbf8356febe3989 (HEAD)
Author: Mike [mebauer@cs.stanford.edu](mailto:mebauer@cs.stanford.edu)
Date:   Fri Feb 23 01:02:04 2024 -0800
 
    legion: fix a bug in the application of remote overwrite physical analyses

Our application works fine with this older commit:

commit 45afa8e658ae06cb19d8f0374de699b7fe4a197c (HEAD)
Merge: 0db333c9d 4dd12470a
Author: Mike [mebauer@cs.stanford.edu](mailto:mebauer@cs.stanford.edu)
Date:   Mon Jul 31 00:57:19 2023 -0700
 
    legion: merge master into control replication and resolve conflicts
and

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Comments: 16 (14 by maintainers)

Most upvoted comments

@apryakhin Is there an MR attached to this?

Please ping me directly before merging any changes this week.