cmssw: HLT Farm crashes in run 378366~378369

Report the large numbers of GPU-related HLT crashes yesterday (elog)

Related to illegal memory access
Special Status_OnCPU path had non-zero rate, unexpected as this once occurs when there is no GPU available
Not fully understood as HLT menus were unchanged with respect to the previous runs
In order to suppress the crashes, all HLT menus were updated to disable all GPUs (elog)
DAQ experts confirmed it to be late crashes from the previous runs (elog)
Related to illegal memory access, the special Status_OnCPU path had a non-zero rate, unexpected as this once occurs when there is no GPU available
Suspected to be related to GPU drivers → in contact with DAQ experts

Here’s the recipe how to reproduce the crashes. (tested with CMSSW_14_0_3 on lxplus8-gpu)

#!/bin/bash -ex

hltGetConfiguration adg:/cdaq/cosmic/commissioning2024/v1.1.0/HLT/V2 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/debug/240325_run378367/files/run378367_ls0016_index000315_fu-c2b05-11-01_pid2219084.root \
  > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Here’s the other way to reproduce the crashes.

# log in to an online GPU development machine (or lxplus8-gpu) and create a CMSSW area for 14.0.2
cmsrel CMSSW_14_0_2
cd CMSSW_14_0_2/src
cmsenv
# copy the HLT configuration that reproduces the crash and run it
https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 378366 > hlt_run378366.py
cat after_menu.py >> hlt_run378366.py ### See after_menu.py below
mkdir run378366
cmsRun hlt_run378366.py &> run378366.log

vi after_menu.py

from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
    buBaseDir = '/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream',
    runNumber = 378366
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
    fileListMode = True,
    fileNames = (
        '/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run378366/run378366_ls0001_index000000_fu-c2b03-05-01_pid1739399.raw',
    )
)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1

@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI

About this issue

Original URL
State: open
Created 3 months ago
Comments: 17 (11 by maintainers)

Most upvoted comments

@mmusich yes, the origin of DQM crashes is the same. It (SOI move) revealed a lack of protection in one of the HCAL reco components (signal time fit in MAHI) added at the end of 2022. Tracked down to a couple of “suboptimal” lines. Protection/workaround is being discussed.

abdoulline on Mar 26, 2024

I’ll make sure the Alpaka implementation has some protection against different SOI/TS configurations

kakwok on Mar 26, 2024

@syuvivida
sure, we’ll report to this open issue (to eventually ask for its closure).

abdoulline on Mar 27, 2024

… if and when we will have a full Alpaka implementation of the HCAL reconstruction, we will have a single code base to maintain 😃

fwyzard on Mar 26, 2024

assign hlt, heterogeneous

makortel on Mar 25, 2024