cmssw: ECAL unpacker: HLT hangs with CMSSW_12_3_0_pre6 when running on GPUs

During a high statistics stress test running the HLT over 2018 data on GPUs with CMSSW_12_3_0_pre6, some of the jobs are getting stuck.

The original configuration from one of the stuck jobs can be extracted from config.zip. The fully expanded python configuration is available in config.py.

I’ve copied the configuration and input file and tried running locally, and the job does get stuck after about 10800 events:

23-Mar-2022 13:13:12 CET  Initiating request to open file file:d0a3d5b6-9ab2-436b-bc18-f7a77ebbdb37.root
23-Mar-2022 13:13:12 CET  Successfully opened file file:d0a3d5b6-9ab2-436b-bc18-f7a77ebbdb37.root
2022-03-23 13:13:18.962097: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operation
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
PersistencyIO    INFO  +++ Set Streamer to dd4hep::OpaqueDataBlock
DD4hep           WARN  ++ Using globally Geant4 unit system (mm,ns,MeV)
DD4CMS           INFO  +++ Processing the CMS detector description xml-memory-buffer
Detector         INFO  *********** Created World volume with size: 101000 101000 450000
Detector         INFO  +++ Patching names of anonymous shapes....
DDDefinition     INFO  +++ Finished processing xml-memory-buffer
Begin processing the 1st record. Run 346455, Event 885899, LumiSection 1 on stream 5 at 23-Mar-2022 13:13:48.511 CET
Begin processing the 2nd record. Run 346455, Event 643725, LumiSection 1 on stream 0 at 23-Mar-2022 13:13:48.512 CET
Begin processing the 3rd record. Run 346455, Event 388872, LumiSection 1 on stream 6 at 23-Mar-2022 13:13:48.513 CET
...
Begin processing the 10866th record. Run 346455, Event 9913356, LumiSection 11 on stream 7 at 23-Mar-2022 13:20:13.087 CET
Begin processing the 10867th record. Run 346455, Event 9789461, LumiSection 11 on stream 6 at 23-Mar-2022 13:20:13.099 CET
Begin processing the 10868th record. Run 346455, Event 9955596, LumiSection 11 on stream 1 at 23-Mar-2022 13:20:13.128 CET

Inspecting the stuck job with GDB shows hints of a possible deadlock between cudaEventRecord() (called by the host caching allocator) and cudaFreeHost (called by the ECAL code that does not use the caching allocator).

Thread 1 (Thread 0x7f943984c440 (LWP 2355091) "cmsRun"):
#0  0x00007f943a6b03f5 in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x00007f941dac1968 in ?? () from /lib64/libcuda.so.1
#2  0x00007f941d80ebf7 in ?? () from /lib64/libcuda.so.1
#3  0x00007f941d8a5093 in ?? () from /lib64/libcuda.so.1
#4  0x00007f94267ae1a8 in ?? () from /data/cmssw/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre6/external/cs8_amd64_gcc10/lib/libcudart.so.11.0
#5  0x00007f942680235b in cudaEventRecord () from /data/cmssw/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre6/external/cs8_amd64_gcc10/lib/libcudart.so.11.0
#6  0x00007f9435ae09f5 in notcub::CachingHostAllocator::HostFree(void*) () from /data/cmssw/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre6/lib/cs8_amd64_gcc10/libHeterogeneousCoreCUDAUtilities.so
#7  0x00007f9435adf81a in cms::cuda::free_host(void*) () from /data/cmssw/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre6/lib/cs8_amd64_gcc10/libHeterogeneousCoreCUDAUtilities.so
#8  0x00007f93c97e2b78 in edm::Wrapper<HeterogeneousSoA<TrackSoAHeterogeneousT<32768> > >::~Wrapper() () from /data/cmssw/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre6/lib/cs8_amd64_gcc10/pluginRecoPixelVertexingPixelTrackFittingPlugins.so
...

Thread 10 (Thread 0x7f93c3dfe700 (LWP 2355165) "cmsRun"):
#0  0x00007f941d9c6270 in ?? () from /lib64/libcuda.so.1
#1  0x00007f941d85a43f in ?? () from /lib64/libcuda.so.1
#2  0x00007f941dad5652 in ?? () from /lib64/libcuda.so.1
#3  0x00007f941dac15e6 in ?? () from /lib64/libcuda.so.1
#4  0x00007f941dac28b1 in ?? () from /lib64/libcuda.so.1
#5  0x00007f941d94d667 in ?? () from /lib64/libcuda.so.1
#6  0x00007f941dad59d9 in ?? () from /lib64/libcuda.so.1
#7  0x00007f941d820590 in ?? () from /lib64/libcuda.so.1
#8  0x00007f941d8d9cb5 in ?? () from /lib64/libcuda.so.1
#9  0x00007f94267dbcfd in ?? () from /data/cmssw/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre6/external/cs8_amd64_gcc10/lib/libcudart.so.11.0
#10 0x00007f94267b01ea in ?? () from /data/cmssw/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre6/external/cs8_amd64_gcc10/lib/libcudart.so.11.0
#11 0x00007f94267e8d70 in cudaFreeHost () from /data/cmssw/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre6/external/cs8_amd64_gcc10/lib/libcudart.so.11.0
#12 0x00007f93da8aeb7a in ecal::UncalibratedRecHit<calo::common::VecStoragePolicy<calo::common::CUDAHostAllocatorAlias> >::~UncalibratedRecHit() () from /data/cmssw/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre6/lib/cs8_amd64_gcc10/pluginRecoLocalCaloEcalRecProducersPlugins.so
#13 0x00007f93da8aee0c in edm::Wrapper<ecal::UncalibratedRecHit<calo::common::VecStoragePolicy<calo::common::CUDAHostAllocatorAlias> > >::~Wrapper() () from /data/cmssw/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre6/lib/cs8_amd64_gcc10/pluginRecoLocalCaloEcalRecProducersPlugins.so
...

The full stack trace is attached: backtrace01.txt.

To investigate this hypothesis, I’ve modified the ECAL and HCAL data formats to use the CachingHostAllocator:

diff --git a/HeterogeneousCore/CUDAUtilities/interface/HostAllocator.h b/HeterogeneousCore/CUDAUtilities/interface/HostAllocator.h
index 142b0c354686..f735950949a5 100644
--- a/HeterogeneousCore/CUDAUtilities/interface/HostAllocator.h
+++ b/HeterogeneousCore/CUDAUtilities/interface/HostAllocator.h
@@ -6,6 +6,7 @@
 #include <cuda_runtime.h>
 
 #include "FWCore/Utilities/interface/thread_safety_macros.h"
+#include "HeterogeneousCore/CUDAUtilities/interface/allocate_host.h"
 
 namespace cms {
   namespace cuda {
@@ -32,22 +33,15 @@ namespace cms {
 
       CMS_THREAD_SAFE T* allocate(std::size_t n) const __attribute__((warn_unused_result)) __attribute__((malloc))
       __attribute__((returns_nonnull)) {
-        void* ptr = nullptr;
-        cudaError_t status = cudaMallocHost(&ptr, n * sizeof(T), FLAGS);
-        if (status != cudaSuccess) {
-          throw bad_alloc(status);
-        }
+        void* ptr = allocate_host(n * sizeof(T), cudaStreamDefault);
         if (ptr == nullptr) {
-          throw std::bad_alloc();
+          throw bad_alloc(cudaErrorMemoryAllocation);
         }
         return static_cast<T*>(ptr);
       }
 
       void deallocate(T* p, std::size_t n) const {
-        cudaError_t status = cudaFreeHost(p);
-        if (status != cudaSuccess) {
-          throw bad_alloc(status);
-        }
+        free_host((void*) p);
       }
     };

With these changes, the job is still getting stuck around event 10880, but the stack trace now does not have any mention of CUDA: backtrace02.txt. All threads are now waiting in

3 TensorFlow worker threads in pthread_cond_wait
7 TBB worker threads in syscall inside tbb::detail::r1::futex_wait
1 TBB thread in tbb::detail::r1::task_dispatcher::local_wait_for_all
2 CUDA event handler threads in poll
the stack trace helper thread in read inside full_read.constprop inside edm::service::InitRootHandlers::stacktraceHelperThread()
one last thread in do_futex_wait.constprop

Some more cross-checks:

re-running the job in the original CMSSW_12_3_0_pre6 release reproducibly hangs at the same event
re-running the job in with HostAllocator changes reproducibly hangs in a close-by event
re-running the job without GPUs it runs over the whole 50k input events, without getting stuck

cmssw: ECAL unpacker: HLT hangs with CMSSW_12_3_0_pre6 when running on GPUs

About this issue

Most upvoted comments