cmssw: HLT crashes in GPU and CPU in collision runs
Dear experts,
During the week of June 13-20, following 3 types of HLT crashes happened in collision runs. HLT was using CMSSW_12_3_5
.
type 1
cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_5-slc7_amd64_gcc10/build/CMSSW_12_3_5-build/tmp/BUILDROOT/32f4c0d8c5d5ff0fb0f1b58023d4424d/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:293: void GPUCACell::find_ntuplets(const Hits&, GPUCACell*, GPUCACell::CellTracksVector&, GPUCACell::HitContainer&, cms::cuda::AtomicPairCounter&, GPUCACell::Quality*, GPUCACell::TmpTuple&, unsigned int, bool) const [with int DEPTH = 2; GPUCACell::Hits = TrackingRecHit2DSOAView; GPUCACell::CellTracksVector = cms::cuda::SimpleVector<cms::cuda::VecArray; GPUCACell::HitContainer = cms::cuda::OneToManyAssoc; GPUCACell::Quality = pixelTrack::Quality; GPUCACell::TmpTuple = cms::cuda::VecArray]: Assertion `tmpNtuplet.size() <= 4' failed.
A fatal system signal has occurred: abort signal
This crash happened on June 13th, during stable beams, collision at 900 GeV. Run number: 353709. The crash happened in a CPU(fu-c2a05-35-01). Elog: http://cmsonline.cern.ch/cms-elog/1143438. Full crash report: https://swmukher.web.cern.ch/swmukher/hltcrash_June13_StableBeam.txt
type 2
Current Modules:
Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: PathStatusInserter:Dataset_ExpressPhysics
Module: EcalRawToDigi:hltEcalDigisLegacy
A fatal system signal has occurred: segmentation violation
Current Modules:
Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: CAHitNtupletCUDA:hltPixelTracksCPU
Module: none
Module: none
A fatal system signal has occurred: segmentation violation
Current Modules:
Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: none
Module: HcalCPURecHitsProducer:hltHbherecoFromGPU
A fatal system signal has occurred: segmentation violation
This type of crashes happened in GPUs (for example: fu-c2a02-35-01). It happened during collision runs when no real collisions were happening. On June 14th (run 353744, Pixel subdetector was out), and on June 18th (run 353932, 353935, 353941, Pixel and tracker subdetectors were out).
type 3
[2] Prefetching for module MeasurementTrackerEventProducer/'hltSiStripClusters'
[3] Prefetching for module SiPixelDigiErrorsFromSoA/'hltSiPixelDigisFromSoA'
[4] Calling method for module SiPixelDigiErrorsSoAFromCUDA/'hltSiPixelDigiErrorsSoA'
Exception Message:
A std::exception was thrown.
cannot create std::vector larger than max_size()
happened in fu-c2a02-39-01 (GPU), in collision run 353941 (Pixel and tracker subdetectors were out), no real collision was ongoing.
Reason of crash (2) and (3) might even be related. Relevant elog on (2) and (3): http://cmsonline.cern.ch/cms-elog/1143515
Regards, Swagata, as HLT DOC during June 13-20.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 50 (50 by maintainers)
I think this is indeed the explaination [*]. Case closed, and sorry again for the noise.
[*] I checked this by keeping the large number of printouts, but also adding
and the program crashed 10/10 times on GPU (running only on the event in question), meaning each time there was a sextuplet on GPU.
Thanks for having a look.
I checked that (unsurprisingly) the HLT runs fine on these ‘error events’, for both CPU and GPU, after changing the 4 to a 5 in the asserts, so in the meantime I’ll open PRs with that change to gain time.
@swagata87 thank you for providing these instructions !
@tsusa you can use the online GPU machines to reproduce the issue:
In my test the problem did not happen every time, I had to run the job a few times before it crashed:
It eventually crashed, though I’m not 100% sure if it was due to the same problem 😕