cmssw: HLT Farm crashes (PFRecHitSoAProducerHCAL@alpaka) when HCAL is out

The HLT farm got a lot of errors in run=379174 since HCAL was removed from the global run

The error is:

An exception of category 'StdException' occurred while
[0] Processing Event run: 379174 lumi: 1 event: 4626 stream: 12
[1] Running path 'DST_PFScouting_DatasetMuon_v1'
[2] Calling method for module PFRecHitSoAProducerHCAL@alpaka/'hltParticleFlowRecHitHBHESoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el8_amd64_gcc12/build/CMSSW_14_0_4-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp(259) 'TApi::setDevice(queue.m_spQueueImpl->m_dev.getNativeHandle())' A previous API call (not this one) set the error : 'cudaErrorInvalidConfiguration': 'invalid configuration argument'!

I will add the recipes to reproduce this error as soon as the data from the run without HCAL is available.

In 379178 HCAL was added back and everything worked fine.

http://cmsonline.cern.ch/cms-elog/1209406

@cms-sw/hlt-l2 @cms-sw/heterogeneous-l2

About this issue

  • Original URL
  • State: closed
  • Created 3 months ago
  • Comments: 23 (23 by maintainers)

Most upvoted comments

@cmsbuild, please close

+heterogeneous

+hlt

I think the issue comes from using the rechit number to specify block launches in the alpaka kernels. I am currently trying to avoid kernel launches in the case where there are 0 hcal rechits (no HCAL).

Just have to make the fix a bit more elegant and I will get a branch together for further testing.

I am taking a look

type pf

Not directly related to this particular issue with HCAL, but based on earlier cases in Run 3, I just wanted to add the comment that this should also be checked for ECAL and Pixel (not sure if this should prompt a different ticket or a tag of corresponding DPG contacts). From our (FOG) side, we could check if there are recent runs available with either detectors out.

@mzarucki as discussed elsewhere applying the same recipe as above, but adjusting the FED selection to exclude either Pixel or ECAL:

_siPixelFEDs = [foo for foo in range(1200, 1349)]
_ECALFEDs = [foo for foo in range(600, 670)]

one can also produce data without those FEDs. Running the same test as above doesn’t produce a crash.

assign hlt, heterogeneous

@cms-sw/pf-l2 FYI