cmssw: Segmentation violation in PromptReco for FastjetJetProducer:ak4PFJets

There is one job failing Reco for Run 366451, dataset ParkingDoubleElectronLowMass, with a segmentation violation, as described in https://cms-talk.web.cern.ch/t/segmentation-error-in-promptreco-for-run-366451-dataset-parkingdoubleelectronlowmass/23152

The crash seems to be from module FastjetJetProducer:

%MSG-w TrackProducerBase:  TrackRefitter:hltTrackRefitterForSiStripMonitorTrack  24-Apr-2023 18:58:38 CEST Run: 366451 Event: 418574346
 BeamSpot is not valid
%MSG
%MSG-e TrackRefitter:  TrackRefitter:hltTrackRefitterForSiStripMonitorTrack  24-Apr-2023 18:58:38 CEST Run: 366451 Event: 418574346
 BeamSpot is (0,0,0), it is probably because is not valid in the event
%MSG

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

...

Current Modules:

Module: FastjetJetProducer:ak4PFJets (crashed)
Module: MultiHitFromChi2EDProducer:pixelLessStepHitTriplets
Module: PFClusterProducer:particleFlowClusterHBHE
Module: RecHitTask:recHitTask
Module: TrackProducer:mixedTripletStepTracks
Module: MuonIdProducer:muons1stStep
Module: TrackProducer:initialStepTracks
Module: CAHitQuadrupletEDProducer:detachedQuadStepHitQuadruplets

A fatal system signal has occurred: segmentation violation

The full log is at /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2023B/job_248341/job/WMTaskSpace/cmsRun1 as described in the original email.

I was able to reproduce the failure locally.

About this issue

Original URL
State: closed
Created a year ago
Comments: 40 (40 by maintainers)

Most upvoted comments

I tried it on lxplus8 and I reproduced the crash, but I had not tried it on regular lxplus.

for the record, on an lxplus8 node, using the recipe above, and a slightly modified PSet:

import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)
    process.options.numberOfThreads = 1
    process.source.skipEvents=cms.untracked.uint32(586)

it will segfault consistently at the first event processed.

mmusich on Apr 26, 2023

Here is an issue from 2022 of a PFCandidate with NaN #39110 (I did not attempt to understand if it would be related though)

yes there was a similar finding last year which was causing photon’s isolation being NaN, when the bad pf candidate ended up in photon’s isolation cone. A preliminary fix was to loop over pf candidate collection, check for NaN and remove those, and make a pfCandNoNaN collection, which was then passed on to calculate isolation. This is where it was done: https://github.com/cms-sw/cmssw/pull/39120/files

maybe something similar can be done for jet/met if this is easier and quicker to do. But of course the real issue need to be solved upstream.

Even if it’s fixed at PF level, such extra protections in POG code are probably not a bad idea as PF code (and logic) is complex and can go wrong in various unforeseen ways, specially in startup phase where alignment/calibrations are not perfect, and several special checks/tests are ongoing using special modes (the interplay of those with PF logic can be hard to predict).

swagata87 on Apr 30, 2023

type pf

mmusich on Apr 27, 2023

were you using scram arch el8_amd64_gcc11?

i also tried last night, and if you use the regular arch one gets in lxplus (not lxplus8): slc7_amd64_gcc11 the crash is not there.

Thanks Marco! I tried it on lxplus8 and I reproduced the crash, but I had not tried it on regular lxplus.

malbouis on Apr 26, 2023

were you using scram arch el8_amd64_gcc11?

i also tried last night, and if you use the regular arch one gets in lxplus (not lxplus8): slc7_amd64_gcc11 the crash is not there.

mmusich on Apr 26, 2023