cmssw: HLT Farm crashes in run 378366~378369
Report the large numbers of GPU-related HLT crashes yesterday (elog)
- Related to illegal memory access
- Special Status_OnCPU path had non-zero rate, unexpected as this once occurs when there is no GPU available
- Not fully understood as HLT menus were unchanged with respect to the previous runs
- In order to suppress the crashes, all HLT menus were updated to disable all GPUs (elog)
- DAQ experts confirmed it to be late crashes from the previous runs (elog)
- Related to illegal memory access, the special Status_OnCPU path had a non-zero rate, unexpected as this once occurs when there is no GPU available
- Suspected to be related to GPU drivers → in contact with DAQ experts
Here’s the recipe how to reproduce the crashes. (tested with CMSSW_14_0_3
on lxplus8-gpu
)
#!/bin/bash -ex
hltGetConfiguration adg:/cdaq/cosmic/commissioning2024/v1.1.0/HLT/V2 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input /store/group/tsg/FOG/debug/240325_run378367/files/run378367_ls0016_index000315_fu-c2b05-11-01_pid2219084.root \
> hlt.py
cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
cmsRun hlt.py &> hlt.log
Here’s the other way to reproduce the crashes.
# log in to an online GPU development machine (or lxplus8-gpu) and create a CMSSW area for 14.0.2
cmsrel CMSSW_14_0_2
cd CMSSW_14_0_2/src
cmsenv
# copy the HLT configuration that reproduces the crash and run it
https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 378366 > hlt_run378366.py
cat after_menu.py >> hlt_run378366.py ### See after_menu.py below
mkdir run378366
cmsRun hlt_run378366.py &> run378366.log
vi after_menu.py
from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
buBaseDir = '/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream',
runNumber = 378366
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
fileListMode = True,
fileNames = (
'/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run378366/run378366_ls0001_index000000_fu-c2b03-05-01_pid1739399.raw',
)
)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1
About this issue
- Original URL
- State: open
- Created 3 months ago
- Comments: 17 (11 by maintainers)
@mmusich yes, the origin of DQM crashes is the same. It (SOI move) revealed a lack of protection in one of the HCAL reco components (signal time fit in MAHI) added at the end of 2022. Tracked down to a couple of “suboptimal” lines. Protection/workaround is being discussed.
I’ll make sure the Alpaka implementation has some protection against different SOI/TS configurations
@syuvivida
sure, we’ll report to this open issue (to eventually ask for its closure).
… if and when we will have a full Alpaka implementation of the HCAL reconstruction, we will have a single code base to maintain 😃
assign hlt, heterogeneous