cmssw: HLT farm crash in run 379617 (part-2)

While reviewing the whole list of error streamer files from run 379617 (related issue https://github.com/cms-sw/cmssw/issues/44769) stored on /eos/cms/store/group/tsg/FOG/debug/240417_run379617/ to ascertain if CMSSW_14_0_5_patch2 fixed all of them using the following script [1] I’ve found a single instance which still crashes taking in input the file /eos/cms/store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root .

To reproduce:

cmsrel CMSSW_14_0_5_patch2
cd CMSSW_14_0_5_patch2/src
cmsenv

and then running:

#!/bin/bash -ex

# CMSSW_14_0_5_patch2

hltGetConfiguration run:379617 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root  > hlt.py
  
cmsRun hlt.py &> hlt.log

On lxplus-gpu the following assertion is hit:

terminate called after throwing an instance of 'std::runtime_error'
  what():  
src/HeterogeneousCore/CUDAUtilities/src/CachingDeviceAllocator.h, line 617:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorAssert: device-side assert triggered



A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

src/RecoTracker/PixelSeeding/plugins/alpaka/BrokenLineFit.dev.cc:167: void alpaka_cuda_async::Kernel_BLFastFit<N, TrackerTraits>::operator()(const TAcc &, const reco::TrackSoA<TrackerTraits>::HitContainer *, const cms::alpakatools::OneToManyAssocRandomAccess<TrackerTraits::tindex_type, <expression>, TrackerTraits::maxNumberOfTuples> *, TrackingRecHitSoA<TrackerTraits>::Layout::ConstView, const pixelCPEforDevice::ParamsOnDeviceT<TrackerTraits> *, TrackerTraits::tindex_type *, double *, float *, double *, unsigned int, unsigned int, signed int) const [with TAcc = alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, 1UL>, unsigned int>; <template-parameter-2-2> = void; int N = 3; TrackerTraits = pixelTopology::Phase1]: block:[69,0,0], thread: [2,0,0] Assertion `fast_fit(3) == fast_fit(3)` failed.

while on lxplus (so on CPU) no crash is observed.

@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI

[1]

Click me

#!/bin/bash -ex

# CMSSW_14_0_5_patch2         
                                                                                            
hltGetConfiguration run:379617 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input file:converted.root  > hlt.py

cat <<@EOF >> hlt.py
process.options.numberOfThreads = 32
process.options.numberOfStreams = 32
@EOF

# Define a function to execute each iteration of the loop
process_file() {
    inputfile="$1"
    outputfile="${inputfile%.root}"
    cp hlt.py hlt_${outputfile}.py
    sed -i "s/file:converted\.root/\/store\/group\/tsg\/FOG\/debug\/240417_run379617\/${inputfile}/g" hlt_${outputfile}.py
    cmsRun hlt_${outputfile}.py &> "${outputfile}.log"
}

# Export the function so it can be used by parallel
export -f process_file

# Find the root files and run the function in parallel using GNU Parallel
eos ls /eos/cms/store/group/tsg/FOG/debug/240417_run379617/ | grep '\.root$' | parallel -j 8 process_file

About this issue

Original URL
State: closed
Created 2 months ago
Comments: 27 (27 by maintainers)

Most upvoted comments

+hlt

issue fixed by https://github.com/cms-sw/cmssw/pull/44808 included in CMSSW_14_0_6
deployed online via CMSSW_14_0_6_MULTIARCHS (intermediate patches were not used online) together with HLT menu GRun 2024 V1.1. The first stable-beams run CMSSW_14_0_6_MULTIARCHS, using the “v1.1.X” HLT menu (see CMSHLT-3164), was run-380306 (Run2024D). No further issues were spotted.

mmusich on May 7, 2024

we still need to look at run 379613 though

for the record, we checked all the available error stream files for that run with the following script [1] in CMSSW_14_0_5_patch2 and found no other crashes.

This particular issue should be dealt with by https://github.com/cms-sw/cmssw/pull/44808 (if accepted).

[1]

#!/bin/bash -ex

# CMSSW_14_0_5_patch2         
                                                                                            
hltGetConfiguration run:379613 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input file:converted.root  > hlt.py

cat <<@EOF >> hlt.py
process.options.numberOfThreads = 32
process.options.numberOfStreams = 32
@EOF

# Define a function to execute each iteration of the loop
process_file() {
    inputfile="$1"
    outputfile="${inputfile%.root}"
    cp hlt.py hlt_${outputfile}.py
    sed -i "s/file:converted\.root/\/store\/group\/tsg\/FOG\/debug\/240417_run379613\/${inputfile}/g" hlt_${outputfile}.py
    cmsRun hlt_${outputfile}.py &> "${outputfile}.log"
}

# Export the function so it can be used by parallel
export -f process_file

# Find the root files and run the function in parallel using GNU Parallel
eos ls /eos/cms/store/group/tsg/FOG/debug/240417_run379613/ | grep '\.root$' | parallel -j 8 process_file

mmusich on Apr 23, 2024

We never assert on NaN (please remove the assert, I do not know who introduced those and in any case none of those are safe in case of fast-math (on host))
Why are we asserting in Alpaka: is this not making the gpu version dramatically slower?
We need to emit an error: a) we do not have (yet) a mechanism to propagate errors from device to host b) most probably we can just leave the NaN percolate and catch it later on Host c) we should just invalidate that track

VinInn on Apr 22, 2024

type tracking

slava77 on Apr 21, 2024

assign hlt, heterogeneous, reconstruction

makortel on Apr 19, 2024