cmssw: TFormula::ReInitializeEvalMethod error in IBs
Two workflows (28234.0 step2 and 23434.999 step3, DIGI in both cases) failed in CMSSW_11_2_X_2020-08-07-1100 with
----- Begin Fatal Exception 07-Aug-2020 15:56:43 CEST-----------------------
An exception of category 'FatalRootError' occurred while
[0] Processing Event run: 1 lumi: 1 event: 3 stream: 2
[1] Running path 'FEVTDEBUGHLToutput_step'
[2] Prefetching for module PoolOutputModule/'FEVTDEBUGHLToutput'
[3] Prefetching for module FastjetJetProducer/'ak4PFL1Calo'
[4] Prefetching for module L1TPFCandMultiMerger/'l1pfCandidates'
[5] Prefetching for module L1TPFProducer/'l1pfProducerBarrel'
[6] Calling method for module L1TkMuonProducer/'L1TkMuons'
Additional Info:
[a] Fatal Root Error: @SUB=TFormula::ReInitializeEvalMethod
Formula is NOT properly initialized - try calling again TFormula::PrepareEvalMethod
----- End Fatal Exception -------------------------------------------------
Full logs https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc820/CMSSW_11_2_X_2020-08-07-1100/pyRelValMatrixLogs/run/28234.0_TTbar_14TeV+2026D60+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal/step2_TTbar_14TeV+2026D60+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal.log#/ https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc820/CMSSW_11_2_X_2020-08-07-1100/pyRelValMatrixLogs/run/23434.999_TTbar_14TeV+2026D49PU_PMXS1S2PR+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+PREMIX_PremixHLBeamSpot14PU+DigiTriggerPU+RecoGlobalPU+HARVESTGlobalPU/step3_TTbar_14TeV+2026D49PU_PMXS1S2PR+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+PREMIX_PremixHLBeamSpot14PU+DigiTriggerPU+RecoGlobalPU+HARVESTGlobalPU.log#/
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (16 by maintainers)
Ok this time for sure, there is a race condition here:
https://github.com/root-project/root/blob/084292ab638923f9260d3a9f813227ada0728565/hist/hist/src/TFormula.cxx#L3347
fLazyInitialization can be cleared by one thread while another is waiting on the mutex, and there’s no recheck on acquiring the mutex. Also need to make the flags atomic for a safe recheck. Probably worth making a bug report, and auditing the rest for missing atomic issues on architectures with weaker ordering guarantees than amd64.
Making it a stream producer should fix it, since then the threads aren’t sharing a single instance.
Sorry about spamming.