cmssw: HLT Farm crashes (tracking related) in run >= 378981
Reporting a large number of HLT crashes (O(50) per run, as we write, but the number of colliding bunches is still low) starting from the first 13.6TeV run of 2024 378981.
Here’s the recipe how to reproduce the crashes (tested with CMSSW_14_0_4
on lxplus8-gpu
):
cmsrel CMSSW_14_0_4
cd CMSSW_14_0_4/src
cmsenv
then prepare the reproducer as:
#!/bin/bash -ex
# CMSSW_14_0_4
hltGetConfiguration run:378981 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input /store/group/tsg/FOG/debug/240405_run378981/files/run378981_ls0002_index000000_fu-c2b03-22-01_pid1720286.root \
> hlt.py
cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.accelerators = ["*"]
@EOF
CUDA_LAUNCH_BLOCKING=1 \
cmsRun hlt.py &> hlt.log
The are several LS affected by the same type of crash, error stream files for the affected lumisections are available at /store/group/tsg/FOG/debug/240405_run378981/files/
The stack trace in all cases contains:
Thread 1 (Thread 0x7fe3ea70b640 (LWP 2004765) "cmsRun"):
#0 0x00007fe3e994291f in poll () from /lib64/libc.so.6
#1 0x00007fe3e345a62f in full_read.constprop () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2 0x00007fe3e340ee3c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3 0x00007fe3e340f7a0 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007fe3b06adf14 in TIDRing::groupedCompatibleDetsV(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&, std::vector<DetGroup, std::allocator<DetGroup> >&) const () from /tmp/musich/CMSSW_14_0_4/lib/el9_amd64_gcc12/libRecoTrackerTkDetLayers.so
#6 0x00007fe3b06ae671 in TIDLayer::groupedCompatibleDetsV(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&, std::vector<DetGroup, std::allocator<DetGroup> >&) const () from /tmp/musich/CMSSW_14_0_4/lib/el9_amd64_gcc12/libRecoTrackerTkDetLayers.so
#7 0x00007fe3e11cf39a in GeometricSearchDet::compatibleDetsV(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&, std::vector<std::pair<GeomDet const*, TrajectoryStateOnSurface>, std::allocator<std::pair<GeomDet const*, TrajectoryStateOnSurface> > >&) const () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/libTrackingToolsDetLayers.so
#8 0x00007fe3e11ce7a1 in GeometricSearchDet::compatibleDets(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/libTrackingToolsDetLayers.so
#9 0x00007fe3ac9f84a0 in LayerMeasurements::recHits(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/libTrackingToolsMeasurementDet.so
#10 0x00007fe3ac577d55 in RectangularEtaPhiTrackingRegion::hits(SeedingLayerSetsHits::SeedingLayer const&) const () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/libRecoTrackerTkTrackingRegions.so
#11 0x00007fe3ac59d2c3 in LayerHitMapCache::operator()(SeedingLayerSetsHits::SeedingLayer const&, TrackingRegion const&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/libRecoTrackerTkHitPairs.so
#12 0x00007fe3ac5a4c66 in HitPairGeneratorFromLayerPair::doublets(TrackingRegion const&, edm::Event const&, edm::EventSetup const&, SeedingLayerSetsHits::SeedingLayer const&, SeedingLayerSetsHits::SeedingLayer const&, LayerHitMapCache&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/libRecoTrackerTkHitPairs.so
#13 0x00007fe374f78c70 in (anonymous namespace)::Impl<(anonymous namespace)::DoNothing, (anonymous namespace)::ImplIntermediateHitDoublets, (anonymous namespace)::RegionsLayersSeparate>::produce(bool, edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoTrackerTkHitPairsPlugins.so
#14 0x00007fe3ebe483c1 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/libFWCoreFramework.so
#15 0x00007fe3ebe2c04e in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/libFWCoreFramework.so
#16 0x00007fe3ebdb9159 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/libFWCoreFramework.so
#17 0x00007fe3ebdb96c4 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/libFWCoreFramework.so
#18 0x00007fe3ebf43f28 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/libFWCoreConcurrency.so
#19 0x00007fe3eb4ef241 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe3e7fc3e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#20 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe3e7fc3e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#21 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#22 0x00007fe3ebd3da6b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/libFWCoreFramework.so
#23 0x00007fe3ebd471ea in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/libFWCoreFramework.so
#24 0x00007fe3ebd47741 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/libFWCoreFramework.so
#25 0x00000000004074f5 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#26 0x00007fe3eb4db96d in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/arena.cpp:688
#27 0x0000000000408ee2 in main::{lambda()#1}::operator()() const ()
#28 0x000000000040517c in main ()
Current Modules:
Module: HitPairEDProducer:hltDisplacedhltIter4PixelLessHitDoubletsForDisplacedTkMuons (crashed)
Module: none
About this issue
- Original URL
- State: open
- Created 3 months ago
- Comments: 21 (21 by maintainers)
type tracking
assign hlt, reconstruction