cmssw: `Fatal Exception` in Prompt Reco of Run 367232, datatset `JetMET0`
Dear all,
there is one job failing Prompt Reco for run 367232, datatset JetMET0
with a Fatal Exception
as described in https://cms-talk.web.cern.ch/t/fatal-exception-in-prompt-reco-of-run-367232-datatset-jetmet0/23996
The crash seems to originate from the module L1TObjectsTiming
:
----- Begin Fatal Exception 12-May-2023 03:17:41 CEST-----------------------
An exception of category 'StdException' occurred while
[0] Processing Event run: 367232 lumi: 190 event: 378449946 stream: 6
[1] Running path 'dqmoffline_1_step'
[2] Calling method for module L1TObjectsTiming/'l1tObjectsTiming'
Exception Message:
A std::exception was thrown.
vector::_M_range_check: __n (which is 9) >= this->size() (which is 5)
----- End Fatal Exception -------------------------------------------------
The exception is reproducible on a lxplus8
node under CMSSW_13_0_5_patch2
(el8_amd64_gcc11
).
Full logs and PSet.py
can be found at https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run367232_JetMET0/Reco/vocms014.cern.ch-415905-3-log.tar.gz
With this modified PSet.py
file the crash occurs immediately:
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
process = pickle.load(handle)
process.options.numberOfThreads = 1
process.source.skipEvents=cms.untracked.uint32(2683)
It should be noted that the crash is preceded by these warning (perhaps related):
%MSG-w L1TStage2uGTTiming: L1TStage2uGTTiming:l1tStage2uGTTiming@streamBeginRun 12-May-2023 08:33:34 CEST Run: 367232 Stream: 0
Algo "L1_SingleJet60er2p5" not found in the trigger menu L1Menu_Collisions2023_v1_0_0. Could not retrieve algo bit number.
%MSG
%MSG-w L1TStage2uGTTiming: L1TStage2uGTTiming:l1tStage2uGTTiming@streamBeginRun 12-May-2023 08:33:34 CEST Run: 367232 Stream: 0
Algo "L1_SingleJet60_FWD3p0" not found in the trigger menu L1Menu_Collisions2023_v1_0_0. Could not retrieve algo bit number.
%MSG
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 49 (49 by maintainers)
@mmusich @makortel Following what I think was the desired short term solution, I have created a filter for the GT digis that will check for corruption (in this case, right now this is only defined as having output BX vectors with size different than a configured size), and will also attempt to produce either an empty BXvector if corruption is detected, or an identical BX vector in the case that it is not. The EDFilter will return false when any corruption of this kind is detected, and true whenever there is no corruption.
@makortel my understanding is that this filter will then need to be inserted after the gt unpacking step, and anything reliant on the GT digis will need to instead be told to get their information from this?
I will attempt to test this on the current problem recipe in any case.
Since this is corrupt data of just the muons in the GT record, I would agree that the best strategy would be to have the GT unpacker detect the error, report it (in a place where a human will eventually notice it - is anyone really checking the LogErrors in the certification workflow?), and produce an empty GT muon collection (no need to fail the other non-muon triggers because if this)
Our approach so far has been to not throw exceptions if the processing can otherwise continue. To me the best option would be for the unpacker to produce a data product that in some conveys the information that the raw data was corrupt in some way. Then everything downstream can easily ignore it if necessary, and e.g. HLT could even have a filter rejecting the event.
While we do have a facility to “skip events for which specific exceptions are thrown”, as discussed in https://github.com/cms-sw/cmssw/issues/41512, it has not been battle-tested, and generally exceptions should not be used for control flow. In principle
should do the job. Based on some experimentation, if the OutputModule consumes the data product from the module that threw the exception, the OutputModule seems to automatically skip the event. If there is no data dependence chain from the throwing module to the OutputModule (which I believe is the case here given the exception coming from a DQM module), the OutputModule needs to be somehow instructed to act based on the Path where the module is. This could be
OutputModule.SelectEvents.SelectEvents
or insertingTriggerResultsFilter
in front of the OutputModule in the EndPath.@mmusich I’ll treat this as a priority tomorrow.