cmssw: HLT crashes in GPU and CPU in collision runs

Dear experts,

During the week of June 13-20, following 3 types of HLT crashes happened in collision runs. HLT was using CMSSW_12_3_5.

  1. type 1
cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_5-slc7_amd64_gcc10/build/CMSSW_12_3_5-build/tmp/BUILDROOT/32f4c0d8c5d5ff0fb0f1b58023d4424d/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:293: void GPUCACell::find_ntuplets(const Hits&, GPUCACell*, GPUCACell::CellTracksVector&, GPUCACell::HitContainer&, cms::cuda::AtomicPairCounter&, GPUCACell::Quality*, GPUCACell::TmpTuple&, unsigned int, bool) const [with int DEPTH = 2; GPUCACell::Hits = TrackingRecHit2DSOAView; GPUCACell::CellTracksVector = cms::cuda::SimpleVector<cms::cuda::VecArray; GPUCACell::HitContainer = cms::cuda::OneToManyAssoc; GPUCACell::Quality = pixelTrack::Quality; GPUCACell::TmpTuple = cms::cuda::VecArray]: Assertion `tmpNtuplet.size() <= 4' failed.


A fatal system signal has occurred: abort signal

This crash happened on June 13th, during stable beams, collision at 900 GeV. Run number: 353709. The crash happened in a CPU(fu-c2a05-35-01). Elog: http://cmsonline.cern.ch/cms-elog/1143438. Full crash report: https://swmukher.web.cern.ch/swmukher/hltcrash_June13_StableBeam.txt

  1. type 2
Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: PathStatusInserter:Dataset_ExpressPhysics
Module: EcalRawToDigi:hltEcalDigisLegacy

A fatal system signal has occurred: segmentation violation
Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: CAHitNtupletCUDA:hltPixelTracksCPU
Module: none
Module: none

A fatal system signal has occurred: segmentation violation
Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: none
Module: HcalCPURecHitsProducer:hltHbherecoFromGPU

A fatal system signal has occurred: segmentation violation

This type of crashes happened in GPUs (for example: fu-c2a02-35-01). It happened during collision runs when no real collisions were happening. On June 14th (run 353744, Pixel subdetector was out), and on June 18th (run 353932, 353935, 353941, Pixel and tracker subdetectors were out).

  1. type 3
[2] Prefetching for module MeasurementTrackerEventProducer/'hltSiStripClusters'
[3] Prefetching for module SiPixelDigiErrorsFromSoA/'hltSiPixelDigisFromSoA'
[4] Calling method for module SiPixelDigiErrorsSoAFromCUDA/'hltSiPixelDigiErrorsSoA'
Exception Message:
A std::exception was thrown.
cannot create std::vector larger than max_size()

happened in fu-c2a02-39-01 (GPU), in collision run 353941 (Pixel and tracker subdetectors were out), no real collision was ongoing.

Reason of crash (2) and (3) might even be related. Relevant elog on (2) and (3): http://cmsonline.cern.ch/cms-elog/1143515

Regards, Swagata, as HLT DOC during June 13-20.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 50 (50 by maintainers)

Most upvoted comments

btw: printf from GPU is not guaranteed to appear if there are too many.

I think this is indeed the explaination [*]. Case closed, and sorry again for the noise.

[*] I checked this by keeping the large number of printouts, but also adding

#ifdef __CUDACC__
    if (tmpNtuplet.size() > 4) {
      __trap();
    }
#endif

and the program crashed 10/10 times on GPU (running only on the event in question), meaning each time there was a sextuplet on GPU.

Thanks for having a look.

I checked that (unsurprisingly) the HLT runs fine on these ‘error events’, for both CPU and GPU, after changing the 4 to a 5 in the asserts, so in the meantime I’ll open PRs with that change to gain time.

@swagata87 thank you for providing these instructions !

@tsusa you can use the online GPU machines to reproduce the issue:

ssh gpu-c2a02-39-01.cms
mkdir -p /data/$USER
cd /data/$USER
source /data/cmssw/cmsset_default.sh
cmsrel CMSSW_12_3_5
cd CMSSW_12_3_5
mkdir run
cd run
cp ~hltpro/error/hlt_error_run353941.py .
cmsRun hlt_error_run353941.py

In my test the problem did not happen every time, I had to run the job a few times before it crashed:

while cmsRun hlt_error_run353941.py; do clear; rm -rf output; done

It eventually crashed, though I’m not 100% sure if it was due to the same problem 😕