cmssw: GsfTracking crash at HLT in run 359686

This issue is related to https://github.com/cms-sw/cmssw/issues/39026 in the sense that similar crashes happened before and a fix for that was put in https://github.com/cms-sw/cmssw/pull/39074. It is unclear to me why the crash came back despite the fix.

Some details of the crash is here

With the error stream file, I could reproduce the crash in GPU machines.

The error stream file(.raw), which is available in Hilton nodes, is also copied to this location: /eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw

In case it is useful, the file is also available in .root format: /eos/cms/store/group/phys_egamma/swmukher/outputFileGSF.root

Recipe to reproduce the crash in GPU machine is given below:

cmsrel CMSSW_12_4_9 
cd CMSSW_12_4_9/src 
cmsenv
hltConfigFromDB --runNumber 359686 > hlt.py
cat >> hlt.py <<@EOF
process.source.fileListMode = True
process.source.fileNames = ['file:/store/error_stream/run359686/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw']
@EOF
cmsRun hlt.py 

It crashes with the following messages:

%MSG-i CUDAService:  (NoModuleName) 01-Oct-2022 22:49:54 CEST pre-events
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)
CUDA device 1: Tesla T4 (sm_75)
%MSG
.....
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:hltEgammaGsfTracks  01-Oct-2022 22:50:29 CEST Run: 359686 Event: 165691107
KF updated state 3 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:hltEgammaGsfTracks  01-Oct-2022 22:50:29 CEST Run: 359686 Event: 165691107
KF updated state 4 is invalid. skipping.
.....
----- Begin Fatal Exception 01-Oct-2022 22:50:29 CEST-----------------------
An exception of category 'LogicError' occurred while
   [0] Processing  Event run: 359686 lumi: 263 event: 165691107 stream: 0
   [1] Running path 'DST_Run3_PFScoutingPixelTracking_v18'
   [2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with and without errors
----- End Fatal Exception -------------------------------------------------

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 45 (43 by maintainers)

Most upvoted comments

thanks, I’ve now managed to check the prompt-reco crash also, and I confirm that your patch solves the crash. So I think it makes sense to put in the fix to avoid further crashes.

Your fix was originally meant for a crash with message MultiTrajectoryState mixes states with different signs of local p_z, which happened both offline and HLT, and as far as I know was not possible to reproduce. Thus the fix was not put in back then. The same crash did not happen at HLT recently. But now the other crash MultiTrajectoryState mixes states with and without errors is also fixed with the same patch. So, it looks like these 2 crashes are after all related to each other?

We surely want to fix the crashes urgently, so let’s go ahead with the PR, but I’ve copied the raw file so that it remain accessible, in case we want to do any other check
/eos/cms/store/group/phys_egamma/ec/swmukher/gsfcrashfile/00266a70-1a8f-4101-8391-ca99df9b51ca.root

The HLT error_stream files are also there, but none of them are reproducible in lxplus-gpu or lxplus(cpu). So it becomes generally difficult for people to debug with HLT error files, as one needs access to one of the P5 GPU machines. The prompt reco crash raw file should be easier to work with, as anyone can use it for debugging purpose in any usual lxplus machine.

At some point it would be great for egamma to understand these strange warnings triggered from BasicTrajectoryState here. It’s unclear to me right now from where we are having missing error or nan matrix. But given the complexity of the Gsf algo it might take a while to fully understand it.