cmssw: HLT crash in run-367906 (`sistrip::FEDBuffer::findChannels()`)

In run-367906 (pp collisions), DAQ reported 1 CMSSW crash at HLT (release: CMSSW_13_0_6) [link to HLT elog].

The stack trace is attached (f3mon_run367906.txt). A piece of stack trace which is possibly relevant is in [1].

The corresponding error-stream files are available, but first attempts to reproduce the crashes offline failed (tried on “Hilton” HLT node).

The recipe used for those failed attempts is adapted in [2] to be valid for lxplus and lxplus-gpu.

FYI: @cms-sw/hlt-l2 @silviodonato @fwyzard @mzarucki @trtomei

[1]

msgtime:2023-05-24 22:37:12
doc_type:cmsswlog
date:2023-05-24T20:37:12.106Z
run:367906
host:fu-c2b03-18-01
pid:2793118
doctype:stacktrace
severity:FATAL
severityVal:4
instance:global
lexicalId:549852445
message:A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
Wed May 24 22:36:52 CEST 2023

(..)

Thread 6 (Thread 0x7fe97ea4f700 (LWP 2794125) "cmsRun"):
#0  0x00007fe9f3d60a71 in poll () from /lib64/libc.so.6
#1  0x00007fe9eac9846f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2  0x00007fe9eac63b6c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  0x00007fe9eac6433b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fe990ee5092 in sistrip::FEDBuffer::findChannels() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libEventFilterSiStripRawToDigi.so
#6  0x00007fe990f5a21e in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripCluste\
rizerPlugins.so
#7  0x00007fe9940a04bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007fe9940a08a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9  0x00007fe9940a30f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw\
/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007fe99400e347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_\
amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007fe8f21a01b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<T\
empTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007fe8f219338d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&\
) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007fe8f2196846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms\
/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007fe8f2150263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_a\
md64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007fe8f2151ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007fe9f67ad95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gc\
c11/libFWCoreFramework.so
#17 0x00007fe9f6794072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCo\
reFramework.so
#18 0x00007fe9f67206da in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm:\
:EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /opt/offline/el8_amd64_gcc11/c\
ms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
#19 0x00007fe9f6720b88 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
#20 0x00007fe9f6475f79 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCore\
Concurrency.so
#21 0x00007fe9f4ef2304 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7fe82e94ab00, waiter=..., this=0x7fe9efd53780) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_\
2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#22 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7fe9efd53780) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-bui\
ld/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#23 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f\
6c08f7b1/tbb-v2021.8.0/src/tbb/arena.cpp:137
#24 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6\
c08f7b1/tbb-v2021.8.0/src/tbb/market.cpp:599
#25 0x00007fe9f4ef44c6 in tbb::detail::r1::rml::private_worker::run (this=0x7fe9efd30100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb\
5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#26 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fe9efd30100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6\
d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#27 0x00007fe9f403e17a in start_thread () from /lib64/libpthread.so.0
#28 0x00007fe9f3d6bdf3 in clone () from /lib64/libc.so.6

(..)

Current Modules:
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates (crashed)
Module: CkfTrackCandidateMaker:hltMuCkfTrackCandidates
Module: PFBlockProducer:hltParticleFlowBlockForDisplTaus
Module: PFBlockProducer:hltParticleFlowBlock
Module: CkfTrackCandidateMaker:hltIter0IterL3FromL1MuonCkfTrackCandidates
Module: PFClusterProducer:hltParticleFlowClusterHBHE
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates
Module: HcalDigisProducerGPU:hltHcalDigisGPU
Module: none
Module: BeamSpotToCUDA:hltOnlineBeamSpotToGPU
Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates
Module: none
Module: PFMultiDepthClusterProducer:hltParticleFlowClusterHCAL
Module: none
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates
Module: HcalCPURecHitsProducer:hltHbherecoFromGPU
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: PFRecHitProducer:hltParticleFlowRecHitPSUnseeded
Module: PixelTrackProducerFromSoAPhase1:hltPixelTracks
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: none
Module: none
Module: SiPixelRecHitCUDAPhase1:hltSiPixelRecHitsGPU
Module: SiPixelRecHitFromCUDAPhase1:hltSiPixelRecHitsFromGPU
Module: HBHERecHitProducerGPU:hltHbherecoGPU
Module: EcalUncalibRecHitProducerGPU:hltEcalUncalibRecHitGPU
Module: FastjetJetProducer:hltAK4CaloJets
Module: CAHitNtupletCUDAPhase1:hltPixelTracksGPU
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx
Module: SiPixelDigisSoAFromCUDA:hltSiPixelDigisSoA
Module: PFBlockProducer:hltParticleFlowBlockCPUOnly
A fatal system signal has occurred: segmentation violation

[2]

#!/bin/bash

# cmsrel CMSSW_13_0_6
# cd CMSSW_13_0_6/src
# cmsenv
# # save this file as test.sh
# chmod u+x test.sh
# ./test.sh 367906 4 # runNumber nThreads

[ $# -eq 2 ] || exit 1

RUNNUM="${1}"
NUMTHREADS="${2}"

ERRDIR=/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream
RUNDIR="${ERRDIR}"/run"${RUNNUM}"

for dirPath in $(ls -d "${RUNDIR}"*); do
  # require at least one non-empty FRD file
  [ $(cd "${dirPath}" ; find -maxdepth 1 -size +0 | grep .raw | wc -l) -gt 0 ] || continue
  runNumber="${dirPath: -6}"
  JOBTAG=test_run"${runNumber}"
  HLTMENU="--runNumber ${runNumber}"
  hltConfigFromDB ${HLTMENU} > "${JOBTAG}".py
  cat <<EOF >> "${JOBTAG}".py
process.options.numberOfThreads = ${NUMTHREADS}
process.options.numberOfStreams = 0
process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)
del process.PrescaleService
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
import os
import glob
process.source.fileListMode = True
process.source.fileNames = sorted([foo for foo in glob.glob("${dirPath}/*raw") if os.path.getsize(foo) > 0])
process.EvFDaqDirector.buBaseDir = "${ERRDIR}"
process.EvFDaqDirector.runNumber = ${runNumber}
process.hltDQMFileSaverPB.runNumber = ${runNumber}
# remove paths containing OutputModules
streamPaths = [pathName for pathName in process.finalpaths_()]
for foo in streamPaths:
    process.__delattr__(foo)
EOF
  rm -rf run"${runNumber}"
  mkdir run"${runNumber}"
  echo "run${runNumber} .."
  cmsRun "${JOBTAG}".py &> "${JOBTAG}".log
  echo "run${runNumber} .. done (exit code: $?)"
  unset runNumber
done
unset dirPath

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 39 (35 by maintainers)

Most upvoted comments

type tracking

If it’s clear that it is a fix (even partial), I would be in favor of backporting it, since we will still use 13_0_X online for a while.

Thanks, I’ll prepare the backports after the review of #41872 completes (in the current form it is easily cherry-pickable).

The backports are in https://github.com/cms-sw/cmssw/pull/41909 (13_1_X) and https://github.com/cms-sw/cmssw/pull/41910 (13_0_X)

(assuming the stack trace is from an HLT job that does the on-demand strip unpacking and clustering

I think this is the case, as the config had

I meant Dan’s stack trace on the assertion failure on aarch64 (sorry for being unclear).

Event IDs in two raw files:

run367906_ls0056_index000213_fu-c2b03-18-01_pid2793118.raw
128082587 - 128091658

run367906_ls0056_index000236_fu-c2b03-18-01_pid2793118.raw
128183442 - 128186805

Last message in the log is from one of previous events (file):

%MSG-e TrajectoryNotPosDef:   TrackProducer:hltL3NoFiltersTkTracksFromL2IOHitNoVtx 24-May-2023 22:36:51 CEST  Run: 367906 Event:  127979616
Trajectory covariance is not positive-definite
%MSG

Timestamps of last few files appearing locally at hltd for that process (last 3).

INFO:2023-05-24 22:36:49 - processIndexFile - RUN:367906 - run367906_ls0056_index000189_pid2793118.jsn

INFO:2023-05-24 22:36:51 - processIndexFile - RUN:367906 - run367906_ls0056_index000213_pid2793118.jsn
INFO:2023-05-24 22:36:52 - processIndexFile - RUN:367906 - run367906_ls0056_index000236_pid2793118.jsn
INFO:2023-05-24 22:37:04 - processCRASHfile - RUN:367906 - 'run367906_ls0000_crash_pid2793118.jsn' with errcode: -11
INFO:2023-05-24 22:37:04 - processCRASHFile - RUN:367906 - inputFileList: run367906_ls0056_index000213_fu-c2b03-18-01_pid2793118.raw,run367906_ls0056_index000236_fu-c2b03-18-01_pid2793118.raw

However, this looks ok. Last two open files by the process were also saved, older ones were alread handled and closed. Source keeps up to 2 files open and buffered at the time.

For the crash, there is no information of event ID (only for Exception this is known).