cmssw: Failures in Run 3 data reprocessing

We are seeing failures in the ongoing Run 3 data reprocessing, presumably related to the DeepTau implementation. Here is just one example of the failure: https://cms-unified.web.cern.ch/cms-unified/report/haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171338_6693

The crash message is:

Exception Message: invalid prediction = nan for tau_index = 0, pred_index = 0

PdmV

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 186 (178 by maintainers)

Most upvoted comments

At T2_US_Nebraska, we’re running a newer kernel (6.3.3-1.el8.elrepo.x86_64 at present) as it seemed to be more stable on our hardware.

Linux v6.0+ includes an additional field in smaps, Pss_Dirty, showing the portion of PSS with dirty pages. The awk command to total the entries looks for any lines beginning with Pss.

I believe this would lead to double-counting of any dirty pages, and jobs being killed unnecessarily.

My summary: There were four, most probably independent, issues:

  1. TauId throws in a loop which is not supposed to be executed as the size of the collection is supposed to be zero. Not solved, not reproduced. Most probably memory corruption.
  2. PSS > RSS Identified, solution proposed at WMCore level.
  3. Suddenly RSS grows “out of control” Reproduced. Seems related to a non-collaborative effort between JeMalloc and THP in presence of scarse resources and memory fragmentation. Solution proposed: use TCMalloc that is explicitly designed to collaborate with THP.
  4. RelVal needs more than 2GB per stream (see the other issue) It seems related to SimHit replay Solution is to run with Nstreams < 0.5*Nthreads (waiting for a reassessment of the need of SimHit replay in particular in Tracker)

conclusion: behaviour not reproducible in the detail. Not correlated to the event content.

Do the above checks indicate that deepTau is not the causing the memory issues? The reason I ask is that when we looked into this previously we determined that the exception message ‘invalid prediction = nan for tau_index = 0, pred_index = 0’ occurs when their is excessive memory usage causing Tenorflow to return a nan values, but we never confirmed that the deepTau modules were responsible for the excessive memory usage in the first place. Is it possible that some other module(s) are responsible and the exception message is a red herring as the deepTau module tends to be the first one to crash when there are memory issues? Is it possible to check the memory usage of all other modules to confirm this?