cmssw: Segmentation violation in PromptReco for FastjetJetProducer:ak4PFJets
There is one job failing Reco for Run 366451, dataset ParkingDoubleElectronLowMass, with a segmentation violation, as described in https://cms-talk.web.cern.ch/t/segmentation-error-in-promptreco-for-run-366451-dataset-parkingdoubleelectronlowmass/23152
The crash seems to be from module FastjetJetProducer:
%MSG-w TrackProducerBase: TrackRefitter:hltTrackRefitterForSiStripMonitorTrack 24-Apr-2023 18:58:38 CEST Run: 366451 Event: 418574346
BeamSpot is not valid
%MSG
%MSG-e TrackRefitter: TrackRefitter:hltTrackRefitterForSiStripMonitorTrack 24-Apr-2023 18:58:38 CEST Run: 366451 Event: 418574346
BeamSpot is (0,0,0), it is probably because is not valid in the event
%MSG
A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
...
Current Modules:
Module: FastjetJetProducer:ak4PFJets (crashed)
Module: MultiHitFromChi2EDProducer:pixelLessStepHitTriplets
Module: PFClusterProducer:particleFlowClusterHBHE
Module: RecHitTask:recHitTask
Module: TrackProducer:mixedTripletStepTracks
Module: MuonIdProducer:muons1stStep
Module: TrackProducer:initialStepTracks
Module: CAHitQuadrupletEDProducer:detachedQuadStepHitQuadruplets
A fatal system signal has occurred: segmentation violation
The full log is at /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2023B/job_248341/job/WMTaskSpace/cmsRun1 as described in the original email.
I was able to reproduce the failure locally.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 40 (40 by maintainers)
for the record, on an
lxplus8node, using the recipe above, and a slightly modified PSet:it will segfault consistently at the first event processed.
yes there was a similar finding last year which was causing photon’s isolation being NaN, when the bad pf candidate ended up in photon’s isolation cone. A preliminary fix was to loop over pf candidate collection, check for NaN and remove those, and make a pfCandNoNaN collection, which was then passed on to calculate isolation. This is where it was done: https://github.com/cms-sw/cmssw/pull/39120/files
maybe something similar can be done for jet/met if this is easier and quicker to do. But of course the real issue need to be solved upstream.
Even if it’s fixed at PF level, such extra protections in POG code are probably not a bad idea as PF code (and logic) is complex and can go wrong in various unforeseen ways, specially in startup phase where alignment/calibrations are not perfect, and several special checks/tests are ongoing using special modes (the interplay of those with PF logic can be hard to predict).
type pf
Thanks Marco! I tried it on lxplus8 and I reproduced the crash, but I had not tried it on regular lxplus.
i also tried last night, and if you use the regular arch one gets in lxplus (not lxplus8):
slc7_amd64_gcc11the crash is not there.