cmssw: HLT crash in run-367906 (`sistrip::FEDBuffer::findChannels()`)
In run-367906 (pp collisions), DAQ reported 1 CMSSW crash at HLT (release: CMSSW_13_0_6
) [link to HLT elog].
The stack trace is attached (f3mon_run367906.txt). A piece of stack trace which is possibly relevant is in [1].
The corresponding error-stream files are available, but first attempts to reproduce the crashes offline failed (tried on “Hilton” HLT node).
The recipe used for those failed attempts is adapted in [2] to be valid for lxplus
and lxplus-gpu
.
FYI: @cms-sw/hlt-l2 @silviodonato @fwyzard @mzarucki @trtomei
[1]
msgtime:2023-05-24 22:37:12
doc_type:cmsswlog
date:2023-05-24T20:37:12.106Z
run:367906
host:fu-c2b03-18-01
pid:2793118
doctype:stacktrace
severity:FATAL
severityVal:4
instance:global
lexicalId:549852445
message:A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
Wed May 24 22:36:52 CEST 2023
(..)
Thread 6 (Thread 0x7fe97ea4f700 (LWP 2794125) "cmsRun"):
#0 0x00007fe9f3d60a71 in poll () from /lib64/libc.so.6
#1 0x00007fe9eac9846f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2 0x00007fe9eac63b6c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3 0x00007fe9eac6433b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007fe990ee5092 in sistrip::FEDBuffer::findChannels() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libEventFilterSiStripRawToDigi.so
#6 0x00007fe990f5a21e in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripCluste\
rizerPlugins.so
#7 0x00007fe9940a04bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8 0x00007fe9940a08a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9 0x00007fe9940a30f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw\
/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007fe99400e347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_\
amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007fe8f21a01b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<T\
empTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007fe8f219338d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&\
) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007fe8f2196846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms\
/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007fe8f2150263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_a\
md64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007fe8f2151ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007fe9f67ad95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gc\
c11/libFWCoreFramework.so
#17 0x00007fe9f6794072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCo\
reFramework.so
#18 0x00007fe9f67206da in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm:\
:EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /opt/offline/el8_amd64_gcc11/c\
ms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
#19 0x00007fe9f6720b88 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
#20 0x00007fe9f6475f79 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCore\
Concurrency.so
#21 0x00007fe9f4ef2304 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7fe82e94ab00, waiter=..., this=0x7fe9efd53780) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_\
2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#22 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7fe9efd53780) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-bui\
ld/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#23 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f\
6c08f7b1/tbb-v2021.8.0/src/tbb/arena.cpp:137
#24 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6\
c08f7b1/tbb-v2021.8.0/src/tbb/market.cpp:599
#25 0x00007fe9f4ef44c6 in tbb::detail::r1::rml::private_worker::run (this=0x7fe9efd30100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb\
5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#26 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fe9efd30100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6\
d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#27 0x00007fe9f403e17a in start_thread () from /lib64/libpthread.so.0
#28 0x00007fe9f3d6bdf3 in clone () from /lib64/libc.so.6
(..)
Current Modules:
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates (crashed)
Module: CkfTrackCandidateMaker:hltMuCkfTrackCandidates
Module: PFBlockProducer:hltParticleFlowBlockForDisplTaus
Module: PFBlockProducer:hltParticleFlowBlock
Module: CkfTrackCandidateMaker:hltIter0IterL3FromL1MuonCkfTrackCandidates
Module: PFClusterProducer:hltParticleFlowClusterHBHE
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates
Module: HcalDigisProducerGPU:hltHcalDigisGPU
Module: none
Module: BeamSpotToCUDA:hltOnlineBeamSpotToGPU
Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates
Module: none
Module: PFMultiDepthClusterProducer:hltParticleFlowClusterHCAL
Module: none
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates
Module: HcalCPURecHitsProducer:hltHbherecoFromGPU
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: PFRecHitProducer:hltParticleFlowRecHitPSUnseeded
Module: PixelTrackProducerFromSoAPhase1:hltPixelTracks
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: none
Module: none
Module: SiPixelRecHitCUDAPhase1:hltSiPixelRecHitsGPU
Module: SiPixelRecHitFromCUDAPhase1:hltSiPixelRecHitsFromGPU
Module: HBHERecHitProducerGPU:hltHbherecoGPU
Module: EcalUncalibRecHitProducerGPU:hltEcalUncalibRecHitGPU
Module: FastjetJetProducer:hltAK4CaloJets
Module: CAHitNtupletCUDAPhase1:hltPixelTracksGPU
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx
Module: SiPixelDigisSoAFromCUDA:hltSiPixelDigisSoA
Module: PFBlockProducer:hltParticleFlowBlockCPUOnly
A fatal system signal has occurred: segmentation violation
[2]
#!/bin/bash
# cmsrel CMSSW_13_0_6
# cd CMSSW_13_0_6/src
# cmsenv
# # save this file as test.sh
# chmod u+x test.sh
# ./test.sh 367906 4 # runNumber nThreads
[ $# -eq 2 ] || exit 1
RUNNUM="${1}"
NUMTHREADS="${2}"
ERRDIR=/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream
RUNDIR="${ERRDIR}"/run"${RUNNUM}"
for dirPath in $(ls -d "${RUNDIR}"*); do
# require at least one non-empty FRD file
[ $(cd "${dirPath}" ; find -maxdepth 1 -size +0 | grep .raw | wc -l) -gt 0 ] || continue
runNumber="${dirPath: -6}"
JOBTAG=test_run"${runNumber}"
HLTMENU="--runNumber ${runNumber}"
hltConfigFromDB ${HLTMENU} > "${JOBTAG}".py
cat <<EOF >> "${JOBTAG}".py
process.options.numberOfThreads = ${NUMTHREADS}
process.options.numberOfStreams = 0
process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)
del process.PrescaleService
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
import os
import glob
process.source.fileListMode = True
process.source.fileNames = sorted([foo for foo in glob.glob("${dirPath}/*raw") if os.path.getsize(foo) > 0])
process.EvFDaqDirector.buBaseDir = "${ERRDIR}"
process.EvFDaqDirector.runNumber = ${runNumber}
process.hltDQMFileSaverPB.runNumber = ${runNumber}
# remove paths containing OutputModules
streamPaths = [pathName for pathName in process.finalpaths_()]
for foo in streamPaths:
process.__delattr__(foo)
EOF
rm -rf run"${runNumber}"
mkdir run"${runNumber}"
echo "run${runNumber} .."
cmsRun "${JOBTAG}".py &> "${JOBTAG}".log
echo "run${runNumber} .. done (exit code: $?)"
unset runNumber
done
unset dirPath
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 39 (35 by maintainers)
type tracking
The backports are in https://github.com/cms-sw/cmssw/pull/41909 (13_1_X) and https://github.com/cms-sw/cmssw/pull/41910 (13_0_X)
I meant Dan’s stack trace on the assertion failure on aarch64 (sorry for being unclear).
Event IDs in two raw files:
Last message in the log is from one of previous events (file):
Timestamps of last few files appearing locally at hltd for that process (last 3).
However, this looks ok. Last two open files by the process were also saved, older ones were alread handled and closed. Source keeps up to 2 files open and buffered at the time.
For the crash, there is no information of event ID (only for Exception this is known).