cmssw: [GPU] Multiple RelVals failing with memory allocation error

Hello,

There are multiple RelVals failing with the following exception in GPU IBs:

----- Begin Fatal Exception 05-Feb-2024 04:19:50 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 366727 lumi: 89 event: 131642946 stream: 3
   [1] Running path 'MC_Run3_PFScoutingPixelTracking_v22'
   [2] Calling method for module HBHERecHitProducerGPU/'hltHbherecoGPU'
Exception Message:
A std::exception was thrown.

/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/5569e690981e3c5d49d7743adaadedca/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_GPU_X_2024-02-04-2300/src/HeterogeneousCore/CUDAUtilities/src/CachingDeviceAllocator.h, line 489:
cudaCheck(error = cudaMalloc(&search_key.d_ptr, search_key.bytes));
cudaErrorMemoryAllocation: out of memory
----- End Fatal Exception -------------------------------------------------

It seems caused by modifications in https://github.com/cms-sw/cmssw/pull/43804.

FYI, @iarspider

Thanks, Andrea

About this issue

  • Original URL
  • State: open
  • Created 5 months ago
  • Comments: 21 (20 by maintainers)

Most upvoted comments

type tracking (even though the association is not strong; it does look like related to the pixel tracking Alpaka migration)

assign heterogeneous, hlt