cmssw: igprof pp segfault in 12_3_0, 12_3_0_pre4 in the Run3 reco step

In two recent releases, igprof pp crashes in the reco step in 11834.21:

$ tail -n10 /eos/cms/store/user/cmsbuild/profiling/data/CMSSW_12_3_0/slc7_amd64_gcc10/11834.21/step3_igprof_cpu.txt
#14 0x00007f36c6833f45 in IgHookTrace::stacktrace (addresses=addresses@entry=0x7ffc58699700, nmax=nmax@entry=800) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-f8a2b39c36d2a318d6c7c0f619242bdb/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/walk-syms.cc:175
#15 0x00007f36c683d508 in profileSignalHandler () at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-f8a2b39c36d2a318d6c7c0f619242bdb/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/profile-perf.cc:66
#16 <signal handler called>
#17 0x00007f368f898ee2 in mkfit::kalmanOperation(int, Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::MatriplexSym<float, 3, 4> const&, Matriplex::Matriplex<float, 3, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 1, 1, 4>&, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so

Current Modules:

Module: MkFitProducer:detachedTripletStepTrackCandidatesMkFit (crashed)

A fatal system signal has occurred: segmentation violation
$ tail -n10 /eos/cms/store/user/cmsbuild/profiling/data/CMSSW_12_4_0_pre3/slc7_amd64_gcc10/11834.21/step3_igprof_cpu.txt
#14 0x00007f9899389f45 in IgHookTrace::stacktrace (addresses=addresses@entry=0x7ffd8674d480, nmax=nmax@entry=800) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-95dc8f7dd3ee3d76c20fd25518fc6fa9/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/walk-syms.cc:175
#15 0x00007f9899393508 in profileSignalHandler () at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-95dc8f7dd3ee3d76c20fd25518fc6fa9/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/profile-perf.cc:66
#16 <signal handler called>
#17 0x00007f9860bcf8ea in mkfit::propagateHelixToZMPlex(Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, int, mkfit::PropagationFlags, Matriplex::Matriplex<int, 1, 1, 4> const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre3/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so

Current Modules:

Module: MkFitProducer:highPtTripletStepTrackCandidatesMkFit (crashed)

A fatal system signal has occurred: segmentation violation

In both cases, the current module is MkFitProducer. Is it a coincidence, or do we have a regression?

Note that igprof mp does not crash in these workflows, and the crash happens around event 230-260. Since jenkins tries to run igprof several times in case of failure, it looks like it’s reproducible.

@slava77 @gartung

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 60 (60 by maintainers)

Most upvoted comments

@smuzaffar could this issue be reopened, just to avoid discussion on a closed issue (we never signed from reco)

It looks like a different bug in libunwind. I will test with gperftools as well since it uses libunwind as well.

Looks like this bug might be addressed by updating libunwind.

It’s not the same event (but roughly ~230…260 events in) in different releases, nor the same module (highPtTripletStepTrackCandidatesMkFit vs. detachedTripletStepTrackCandidatesMkFit).

there were no recent updates in the propagate or kalman update routines. This still seems similar to the previous case where the issue was with the profiler itself having some outdated (was it TBB or pthread?) dependencies.