cmssw: Raw data object rarely goes missing in multi-threaded HLT jobs

During recent output module tests which involve a lot of CPU time spent in output modules (for 2021 Heavy Ion running), I have been encountering spurious errors with 4-thread and 4-stream cmsRun jobs and would like to report it.

The issue appears when a module can not find product inserted by the input source (invalid handle). This causes following error message in the L1 HLT module (found in VirginRaw HLT menu):

%MSG-e L1T:  L1TRawToDigi:hltGtStage2Digis 03-Mar-2020 18:59:42 CET  Run: 1000001680 Event: 6996208
Cannot unpack: no FEDRawDataCollection found
%MSG

(followed by other fatal exception).

It happens here: https://github.com/cms-sw/cmssw/blob/master/EventFilter/L1TRawToDigi/plugins/L1TRawToDigi.cc#L140

It is relatively rare and hard to reproduce. I was running on a scaled-down HLT farm of 12 nodes (32 cores per node, 1 proces assigned per 4 cores). In this case problem appears every 30 minutes. It also only appears when CPU load is near maximum (which is achieved using expensive LZMA compression in the output module or doing other equivalent work with similar CPU time, as I tested). It did not appear with 1-stream/thread setup. While I first saw this with CMSSW_11_0_1, further testing showed also the same problem with CMSSW_10_6_8.

I did some debugging, and it appears code in FedRawDataInputSource which inserts RawDataCollection does run (according to logged message) before this issue appears in the same event, i.e. it seems that the product goes missing in between.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 93 (92 by maintainers)

Most upvoted comments

I have been running for several hours in total (it fills disk after 1 hour), but so far no assertion (I would get it within minutes normally). It seems that the problem is solved by proper ordering of the modules.