cmssw: HLT crash caused by `SiPixelDigisClustersFromSoA` (run 357271)
In run-357271, one HLT job crashed with the following error message:
[2] Calling method for module SiPixelDigisClustersFromSoA/'hltSiPixelClustersFromSoA'
Exception Message:
DetSetVector::inserv called with index already in collection;
index value: 344794116
(the monitoring tool does not provide the full error message from cmsRun
, afaik)
The error is reproducible (see recipe below). Since it originates from the GPU branch of the reconstruction sequence, it can be reproduced only on a machine with a GPU. The input file is currently on lxplus
.
FYI: @fwyzard @silviodonato
cmsrel CMSSW_12_4_6
cd CMSSW_12_4_6/src
cmsenv
hltGetConfiguration \
run:357271 \
--globaltag 124X_dataRun3_HLT_v4 \
--process HLT \
--data \
--unprescale \
--output all \
--input file:/afs/cern.ch/work/m/missirol/public/fog/edm_run357271_ls1351.root \
> hlt.py
cmsRun hlt.py &> hlt.log
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 57 (56 by maintainers)
By the way, the release is ready now: https://github.com/cms-sw/cmssw/releases/tag/CMSSW_12_4_10_patch1
Status as of today: https://github.com/cms-sw/cmssw/pull/39711 is merged, will wait for IBs this evening and then merge the 12_4 PR when completed, and cut a 12_4_10_patch1.
Having tested the configuration posted by @missirol above with a dummy fix in which we sort the digis in
SiPixelDigisClustersFromSoA
by the raw id:everything run smoothly. I don’t see any obvious drawback (the sorting itself takes $0.71 \pm 0.12 \text{ms}$ on a gpu machine at P5). If this makes sense I may quickly open a PR.
Shouldn’t that be
I’m surprised the compiler didn’t warn, modern compilers are generally pretty good at spotting implicit type conversions in printf.
I did a bit of investigation and the code is crashing in https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_6/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc#L114 where
edmNew::DetSetVector<SiPixelCluster>::FastFiller
is constructed. The constructor eventually end up callingedmNew::DetSetVector<T>::addItem
which in turn ends up callingedmNew::dstvdetails::errorIdExists
where an exception is thrown. The reason for the exception is that theedmNew::DetSetVector<SiPixelCluster>::FastFiller
assumes that a single DetId appears only once in a loop. Instead, what I observe in run: 357271 lumi: 1351 event: 2627443577 are DetIds which are not contiguous in the loop over digis inSiPixelDigisClustersFromSoA
. Here is the printoutobtained by adding the following changes to
SiPixelDigisClustersFromSoA
The root cause of the problem therefore seems to be somewhere else, presumably in either
SiPixelDigisSoAFromCUDA
orSiPixelRawToClusterCUDA
. @VinInn or @AdrianoDee might have a better idea what could be the cause.