cmssw: Failures in Run 3 data reprocessing
We are seeing failures in the ongoing Run 3 data reprocessing, presumably related to the DeepTau implementation. Here is just one example of the failure: https://cms-unified.web.cern.ch/cms-unified/report/haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171338_6693
The crash message is:
Exception Message: invalid prediction = nan for tau_index = 0, pred_index = 0
PdmV
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 186 (178 by maintainers)
At T2_US_Nebraska, we’re running a newer kernel (6.3.3-1.el8.elrepo.x86_64 at present) as it seemed to be more stable on our hardware.
Linux v6.0+ includes an additional field in smaps, Pss_Dirty, showing the portion of PSS with dirty pages. The awk command to total the entries looks for any lines beginning with
Pss
.I believe this would lead to double-counting of any dirty pages, and jobs being killed unnecessarily.
My summary: There were four, most probably independent, issues:
conclusion: behaviour not reproducible in the detail. Not correlated to the event content.
Do the above checks indicate that deepTau is not the causing the memory issues? The reason I ask is that when we looked into this previously we determined that the exception message ‘invalid prediction = nan for tau_index = 0, pred_index = 0’ occurs when their is excessive memory usage causing Tenorflow to return a nan values, but we never confirmed that the deepTau modules were responsible for the excessive memory usage in the first place. Is it possible that some other module(s) are responsible and the exception message is a red herring as the deepTau module tends to be the first one to crash when there are memory issues? Is it possible to check the memory usage of all other modules to confirm this?