legion: Realm::TimeLimit floating point exception
Our FleCSI application runs fine on 2 GPUs (1 per rank), but at 3 and 4 GPUs, realm throws a floating point exception. Here is the backtrace
(gdb) bt
#0 0x00001471c73c5cc1 in [clock_nanosleep@GLIBC_2.2.5](mailto:clock_nanosleep@GLIBC_2.2.5) () from /lib64/libc.so.6
#1 0x00001471c73cb9c3 in nanosleep () from /lib64/libc.so.6
#2 0x00001471c73cb8da in sleep () from /lib64/libc.so.6
#3 0x00001471cdcb8bea in Realm::realm_freeze(int) ()
from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#4 <signal handler called>
#5 0x00001471cdd063a2 in Realm::Cuda::GPUIndirectXferDes::progress_xd(Realm::Cuda::GPUIndirectChannel*, Realm::TimeLimit) ()
from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#6 0x00001471cdd0e05d in Realm::XDQueue<Realm::Cuda::GPUIndirectChannel, Realm::Cuda::GPUIndirectXferDes>::do_work(Realm::TimeLimit) ()
from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#7 0x00001471cdc17b01 in Realm::BackgroundWorkManager::Worker::do_work(long long, Realm::atomic<bool>*) ()
from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#8 0x00001471cdc174b8 in Realm::BackgroundWorkThread::main_loop() ()
from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#9 0x00001471cdcdff7e in Realm::KernelThread::pthread_entry(void*) ()
from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#10 0x00001471ccf986ea in start_thread () from /lib64/libpthread.so.0
#11 0x00001471c7401a6f in clone () from /lib64/libc.so.6
This run used the following commit:
commit 21500c7e3eb7f123b8e6c3ec2cbf8356febe3989 (HEAD)
Author: Mike [mebauer@cs.stanford.edu](mailto:mebauer@cs.stanford.edu)
Date: Fri Feb 23 01:02:04 2024 -0800
legion: fix a bug in the application of remote overwrite physical analyses
Our application works fine with this older commit:
commit 45afa8e658ae06cb19d8f0374de699b7fe4a197c (HEAD)
Merge: 0db333c9d 4dd12470a
Author: Mike [mebauer@cs.stanford.edu](mailto:mebauer@cs.stanford.edu)
Date: Mon Jul 31 00:57:19 2023 -0700
legion: merge master into control replication and resolve conflicts
and
About this issue
- Original URL
- State: closed
- Created 4 months ago
- Comments: 16 (14 by maintainers)
@apryakhin Is there an MR attached to this?
Please ping me directly before merging any changes this week.