cmssw: `DeepTauId` failures in RelVals (`Incompatible shapes`)
Running RelVals we are observing some failures due to a tensorflow exception coming from DeepTauId
module. Some examples listed here.
1) 2023 Data reHLT + reRECO
In HLTDR3_2023
step in path HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7
in 14_0_0_pre3
RelVals
Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 367131 lumi: 11 event: 22076365 stream: 0
[1] Running path 'HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducerForVBFIsoTau'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
[[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]
with the config here, that is what we get from wf 141.035
running L1REPACK:Full,HLT:@relval2024
(HLT pointing at GRun here). The error here. The wf on Stats2.
Also in the same step in 13_3_0_pre5
RunDisplacedJet2023C in a different path (HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet60_v6
) run in HLT:@relval2023
. The error here. The wf on Stats2.
2) 2022 Data reHLT + reRECO
Much rarer in AODNANORUN3_reHLT_2022
step in deepTau2017v2p1ForMini
in RunJetMET2022D
with 14_0_0
The error here. The wf on Stats2.
Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 357735 lumi: 20 event: 32782226 stream: 0
[1] Running path 'NANOEDMAODoutput_step'
[2] Prefetching for module PoolOutputModule/'NANOEDMAODoutput'
[3] Prefetching for module SimpleCandidateFlatTableProducer/'boostedTauTable'
[4] Prefetching for module PATObjectCrossLinker/'linkedObjects'
[5] Prefetching for module PATJetRefSelector/'finalJetsPuppi'
[6] Prefetching for module PATJetUserDataEmbedder/'updatedJetsPuppiWithUserData'
[7] Prefetching for module PATJetUpdater/'updatedJetsPuppi'
[8] Prefetching for module PATJetSelector/'slimmedJetsPuppi'
[9] Prefetching for module PATJetUpdater/'updatedPatJetsTransientCorrectedSlimmedPuppiWithDeepTags'
[10] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetFromMiniAODAK4PuppiCentralJetTagsSlimmedPuppiWithDeepTags'
[11] Prefetching for module ParticleNetFeatureEvaluator/'pfParticleNetFromMiniAODAK4PuppiCentralTagInfosSlimmedPuppiWithDeepTags'
[12] Prefetching for module PATTauIDEmbedder/'slimmedTaus'
[13] Calling method for module DeepTauId/'deepTau2017v2p1ForMini'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
[[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]
3) MC 2023
In DigiPU_2023PU
step in hltHpsPFTauDeepTauProducer
in RelValTenTau_15_500
with 13_3_0_pre1
(at the moment the first occurrence I found). The error here. The wf on Stats2.
Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 1 lumi: 18 event: 1707 stream: 1
[1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_OneProng_M5to80_v2'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,38] vs. [92]
[[{{node inner_hadrons_norm_1/FusedBatchNorm_1/Mul}}]]
CPU
At the moment it appears that in all cases the jobs were running on Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
(or on a Gold
one), Cascade Lake (see https://github.com/cms-sw/cmssw/issues/44333#issuecomment-1983672263).
About this issue
- Original URL
- State: closed
- Created 4 months ago
- Comments: 47 (47 by maintainers)
Hi all! I investigated the reproducer and I think I found the issue.
The number of
valid_grid_cells
here is 0 for this event and this is creating aTF::Tensor
with shape [0, 1, 1, N].In TensorFlow this is a valid tensor which has a specific shape but it is empty.
Apparently, when this input is passed to a TF model executed on a CPU without
AVX512F AVX512_VNNI
, the model is executed and returns an empty output without complaining. WhenAVX512F AVX512_VNNI
instructions are present, the jitting is different and the TF executor complains. Now, I’m not saying that it is understood why this happens, but this is the reason of the crash.I can prepare a PR with guards to avoid the execution of the model with empty inputs, and in parallel investigate more deeply this TF behaviour.
urgent
@valsdav, we have established that this issue can affect Prompt Reconstruction and (potentially, when the new nodes for the HLT farm arrive) also online trigger operations. Please prepare PRs with guards to avoid the execution of the model with empty inputs. Thank you.
Marco (as ORM)
@cmsbuild, please close
+hlt
@cms-sw/reconstruction-l2 this looks like needs a separate issue. Can you open one?
+pdmv (really only the reporter)
+ml
Basic guards to solve the empty input problem in DeepTauId are in place, but the reason of the empty grid needs to be investigated with Tau experts.
A more general guard for empty inputs will be added (see https://github.com/cms-sw/cmssw/issues/44481)
+1 solved by https://github.com/cms-sw/cmssw/pull/44455
for record, the proposed fixes are:
I think this happens simply because that particular trigger path (
HLT_DoublePFJets40_Mass500_MediumDeepTauPFTauHPS45_L2NN_MediumDeepTauPFTauHPS20_eta2p1_v
) got removed in the meanwhile in https://github.com/cms-sw/cmssw/pull/44073 (14_1_X) and https://github.com/cms-sw/cmssw/pull/44074 (14_0_X). I think the reproducer would succeed in CMSSW_14_0_1 as well (but I didn’t test it).since the process is run multi-threaded are you sure that the last event that leaves a message logger record is also the one crashing the process?
assign reconstruction
assign ml
This is a different path, so it points to a general problem with
DeepTauId
(path-aspecific)type tau
@cms-sw/tau-pog-l2 FYI
assign pdmv
assign hlt