cmssw: HLT crash caused by `SiPixelDigisClustersFromSoA` (run 357271)
In run-357271, one HLT job crashed with the following error message:
[2] Calling method for module SiPixelDigisClustersFromSoA/'hltSiPixelClustersFromSoA'
Exception Message:
DetSetVector::inserv called with index already in collection;
index value: 344794116
(the monitoring tool does not provide the full error message from cmsRun, afaik)
The error is reproducible (see recipe below). Since it originates from the GPU branch of the reconstruction sequence, it can be reproduced only on a machine with a GPU. The input file is currently on lxplus.
FYI: @fwyzard @silviodonato
cmsrel CMSSW_12_4_6
cd CMSSW_12_4_6/src
cmsenv
hltGetConfiguration \
run:357271 \
--globaltag 124X_dataRun3_HLT_v4 \
--process HLT \
--data \
--unprescale \
--output all \
--input file:/afs/cern.ch/work/m/missirol/public/fog/edm_run357271_ls1351.root \
> hlt.py
cmsRun hlt.py &> hlt.log
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 57 (56 by maintainers)
By the way, the release is ready now: https://github.com/cms-sw/cmssw/releases/tag/CMSSW_12_4_10_patch1
Status as of today: https://github.com/cms-sw/cmssw/pull/39711 is merged, will wait for IBs this evening and then merge the 12_4 PR when completed, and cut a 12_4_10_patch1.
Having tested the configuration posted by @missirol above with a dummy fix in which we sort the digis in
SiPixelDigisClustersFromSoAby the raw id:everything run smoothly. I don’t see any obvious drawback (the sorting itself takes $0.71 \pm 0.12 \text{ms}$ on a gpu machine at P5). If this makes sense I may quickly open a PR.
Shouldn’t that be
I’m surprised the compiler didn’t warn, modern compilers are generally pretty good at spotting implicit type conversions in printf.
I did a bit of investigation and the code is crashing in https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_6/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc#L114 where
edmNew::DetSetVector<SiPixelCluster>::FastFilleris constructed. The constructor eventually end up callingedmNew::DetSetVector<T>::addItemwhich in turn ends up callingedmNew::dstvdetails::errorIdExistswhere an exception is thrown. The reason for the exception is that theedmNew::DetSetVector<SiPixelCluster>::FastFillerassumes that a single DetId appears only once in a loop. Instead, what I observe in run: 357271 lumi: 1351 event: 2627443577 are DetIds which are not contiguous in the loop over digis inSiPixelDigisClustersFromSoA. Here is the printoutobtained by adding the following changes to
SiPixelDigisClustersFromSoAThe root cause of the problem therefore seems to be somewhere else, presumably in either
SiPixelDigisSoAFromCUDAorSiPixelRawToClusterCUDA. @VinInn or @AdrianoDee might have a better idea what could be the cause.