cmssw: HLT Farm crashes in run 378366~378369

Report the large numbers of GPU-related HLT crashes yesterday (elog)

  • Related to illegal memory access
  • Special Status_OnCPU path had non-zero rate, unexpected as this once occurs when there is no GPU available
  • Not fully understood as HLT menus were unchanged with respect to the previous runs
  • In order to suppress the crashes, all HLT menus were updated to disable all GPUs (elog)
  • DAQ experts confirmed it to be late crashes from the previous runs (elog)
  • Related to illegal memory access, the special Status_OnCPU path had a non-zero rate, unexpected as this once occurs when there is no GPU available
  • Suspected to be related to GPU drivers → in contact with DAQ experts

Here’s the recipe how to reproduce the crashes. (tested with CMSSW_14_0_3 on lxplus8-gpu)

#!/bin/bash -ex

hltGetConfiguration adg:/cdaq/cosmic/commissioning2024/v1.1.0/HLT/V2 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/debug/240325_run378367/files/run378367_ls0016_index000315_fu-c2b05-11-01_pid2219084.root \
  > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Here’s the other way to reproduce the crashes.

# log in to an online GPU development machine (or lxplus8-gpu) and create a CMSSW area for 14.0.2
cmsrel CMSSW_14_0_2
cd CMSSW_14_0_2/src
cmsenv
# copy the HLT configuration that reproduces the crash and run it
https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 378366 > hlt_run378366.py
cat after_menu.py >> hlt_run378366.py ### See after_menu.py below
mkdir run378366
cmsRun hlt_run378366.py &> run378366.log

vi after_menu.py

from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
    buBaseDir = '/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream',
    runNumber = 378366
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
    fileListMode = True,
    fileNames = (
        '/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run378366/run378366_ls0001_index000000_fu-c2b03-05-01_pid1739399.raw',
    )
)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1

@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI

About this issue

  • Original URL
  • State: open
  • Created 3 months ago
  • Comments: 17 (11 by maintainers)

Most upvoted comments

@mmusich yes, the origin of DQM crashes is the same. It (SOI move) revealed a lack of protection in one of the HCAL reco components (signal time fit in MAHI) added at the end of 2022. Tracked down to a couple of “suboptimal” lines. Protection/workaround is being discussed.

I’ll make sure the Alpaka implementation has some protection against different SOI/TS configurations

@syuvivida
sure, we’ll report to this open issue (to eventually ask for its closure).

… if and when we will have a full Alpaka implementation of the HCAL reconstruction, we will have a single code base to maintain 😃

assign hlt, heterogeneous