cmssw: HLT farm crash in run 379617 (part-2)
While reviewing the whole list of error streamer files from run 379617 (related issue https://github.com/cms-sw/cmssw/issues/44769) stored on /eos/cms/store/group/tsg/FOG/debug/240417_run379617/
to ascertain if CMSSW_14_0_5_patch2
fixed all of them using the following script [1] I’ve found a single instance which still crashes taking in input the file /eos/cms/store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root
.
To reproduce:
cmsrel CMSSW_14_0_5_patch2
cd CMSSW_14_0_5_patch2/src
cmsenv
and then running:
#!/bin/bash -ex
# CMSSW_14_0_5_patch2
hltGetConfiguration run:379617 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input /store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root > hlt.py
cmsRun hlt.py &> hlt.log
On lxplus-gpu
the following assertion is hit:
terminate called after throwing an instance of 'std::runtime_error'
what():
src/HeterogeneousCore/CUDAUtilities/src/CachingDeviceAllocator.h, line 617:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorAssert: device-side assert triggered
A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.
src/RecoTracker/PixelSeeding/plugins/alpaka/BrokenLineFit.dev.cc:167: void alpaka_cuda_async::Kernel_BLFastFit<N, TrackerTraits>::operator()(const TAcc &, const reco::TrackSoA<TrackerTraits>::HitContainer *, const cms::alpakatools::OneToManyAssocRandomAccess<TrackerTraits::tindex_type, <expression>, TrackerTraits::maxNumberOfTuples> *, TrackingRecHitSoA<TrackerTraits>::Layout::ConstView, const pixelCPEforDevice::ParamsOnDeviceT<TrackerTraits> *, TrackerTraits::tindex_type *, double *, float *, double *, unsigned int, unsigned int, signed int) const [with TAcc = alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, 1UL>, unsigned int>; <template-parameter-2-2> = void; int N = 3; TrackerTraits = pixelTopology::Phase1]: block:[69,0,0], thread: [2,0,0] Assertion `fast_fit(3) == fast_fit(3)` failed.
while on lxplus
(so on CPU) no crash is observed.
@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI
[1]
Click me
#!/bin/bash -ex
# CMSSW_14_0_5_patch2
hltGetConfiguration run:379617 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input file:converted.root > hlt.py
cat <<@EOF >> hlt.py
process.options.numberOfThreads = 32
process.options.numberOfStreams = 32
@EOF
# Define a function to execute each iteration of the loop
process_file() {
inputfile="$1"
outputfile="${inputfile%.root}"
cp hlt.py hlt_${outputfile}.py
sed -i "s/file:converted\.root/\/store\/group\/tsg\/FOG\/debug\/240417_run379617\/${inputfile}/g" hlt_${outputfile}.py
cmsRun hlt_${outputfile}.py &> "${outputfile}.log"
}
# Export the function so it can be used by parallel
export -f process_file
# Find the root files and run the function in parallel using GNU Parallel
eos ls /eos/cms/store/group/tsg/FOG/debug/240417_run379617/ | grep '\.root$' | parallel -j 8 process_file
About this issue
- Original URL
- State: closed
- Created 2 months ago
- Comments: 27 (27 by maintainers)
+hlt
CMSSW_14_0_6_MULTIARCHS
(intermediate patches were not used online) together with HLT menu GRun 2024 V1.1. The first stable-beams runCMSSW_14_0_6_MULTIARCHS
, using the “v1.1.X” HLT menu (see CMSHLT-3164), was run-380306 (Run2024D). No further issues were spotted.for the record, we checked all the available error stream files for that run with the following script [1] in
CMSSW_14_0_5_patch2
and found no other crashes.This particular issue should be dealt with by https://github.com/cms-sw/cmssw/pull/44808 (if accepted).
[1]
type tracking
assign hlt, heterogeneous, reconstruction