cmssw: `DeepTauId` failures in RelVals (`Incompatible shapes`)

Running RelVals we are observing some failures due to a tensorflow exception coming from DeepTauId module. Some examples listed here.

1) 2023 Data reHLT + reRECO

In HLTDR3_2023 step in path HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7 in 14_0_0_pre3 RelVals

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 367131 lumi: 11 event: 22076365 stream: 0
[1] Running path 'HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducerForVBFIsoTau'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
[[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]

with the config here, that is what we get from wf 141.035 running L1REPACK:Full,HLT:@relval2024 (HLT pointing at GRun here). The error here. The wf on Stats2.

Also in the same step in 13_3_0_pre5 RunDisplacedJet2023C in a different path (HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet60_v6 ) run in HLT:@relval2023. The error here. The wf on Stats2.

2) 2022 Data reHLT + reRECO

Much rarer in AODNANORUN3_reHLT_2022 step in deepTau2017v2p1ForMini in RunJetMET2022D with 14_0_0 The error here. The wf on Stats2.

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 357735 lumi: 20 event: 32782226 stream: 0
[1] Running path 'NANOEDMAODoutput_step'
[2] Prefetching for module PoolOutputModule/'NANOEDMAODoutput'
[3] Prefetching for module SimpleCandidateFlatTableProducer/'boostedTauTable'
[4] Prefetching for module PATObjectCrossLinker/'linkedObjects'
[5] Prefetching for module PATJetRefSelector/'finalJetsPuppi'
[6] Prefetching for module PATJetUserDataEmbedder/'updatedJetsPuppiWithUserData'
[7] Prefetching for module PATJetUpdater/'updatedJetsPuppi'
[8] Prefetching for module PATJetSelector/'slimmedJetsPuppi'
[9] Prefetching for module PATJetUpdater/'updatedPatJetsTransientCorrectedSlimmedPuppiWithDeepTags'
[10] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetFromMiniAODAK4PuppiCentralJetTagsSlimmedPuppiWithDeepTags'
[11] Prefetching for module ParticleNetFeatureEvaluator/'pfParticleNetFromMiniAODAK4PuppiCentralTagInfosSlimmedPuppiWithDeepTags'
[12] Prefetching for module PATTauIDEmbedder/'slimmedTaus'
[13] Calling method for module DeepTauId/'deepTau2017v2p1ForMini'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
[[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]

3) MC 2023

In DigiPU_2023PU step in hltHpsPFTauDeepTauProducer in RelValTenTau_15_500 with 13_3_0_pre1 (at the moment the first occurrence I found). The error here. The wf on Stats2.

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 1 lumi: 18 event: 1707 stream: 1
[1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_OneProng_M5to80_v2'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,38] vs. [92]
[[{{node inner_hadrons_norm_1/FusedBatchNorm_1/Mul}}]]

CPU

At the moment it appears that in all cases the jobs were running on Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz (or on a Gold one), Cascade Lake (see https://github.com/cms-sw/cmssw/issues/44333#issuecomment-1983672263).

About this issue

Original URL
State: closed
Created 4 months ago
Comments: 47 (47 by maintainers)

Most upvoted comments

Hi all! I investigated the reproducer and I think I found the issue.

The number of valid_grid_cells here is 0 for this event and this is creating a TF::Tensor with shape [0, 1, 1, N].

In TensorFlow this is a valid tensor which has a specific shape but it is empty.

>>> import tensorflow as tf
>>> tensor = tf.zeros([0, 1, 1, 86])
>>> tensor
<tf.Tensor: shape=(0, 1, 1, 86), dtype=float32, numpy=array([], shape=(0, 1, 1, 86), dtype=float32)>
>>> tf.print(tensor)
[]

Apparently, when this input is passed to a TF model executed on a CPU without AVX512F AVX512_VNNI, the model is executed and returns an empty output without complaining. When AVX512F AVX512_VNNI instructions are present, the jitting is different and the TF executor complains. Now, I’m not saying that it is understood why this happens, but this is the reason of the crash.

I can prepare a PR with guards to avoid the execution of the model with empty inputs, and in parallel investigate more deeply this TF behaviour.

valsdav on Mar 13, 2024

urgent

This failure was now seen in Tier0 PromptReco https://cms-talk.web.cern.ch/t/update-t0-skim-config-for-2024-pp-collision/36794/5

I can prepare a PR with guards to avoid the execution of the model with empty inputs, and in parallel investigate more deeply this TF behaviour.

@valsdav, we have established that this issue can affect Prompt Reconstruction and (potentially, when the new nodes for the HLT farm arrive) also online trigger operations. Please prepare PRs with guards to avoid the execution of the model with empty inputs. Thank you.

Marco (as ORM)

mmusich on Mar 16, 2024

@cmsbuild, please close

makortel on Mar 25, 2024

+hlt

no issues observed after the 14.0.X PR got merged and tested in IBs.

mmusich on Mar 25, 2024

but the reason of the empty grid needs to be investigated with Tau experts.

@cms-sw/reconstruction-l2 this looks like needs a separate issue. Can you open one?

mmusich on Mar 20, 2024

+pdmv (really only the reporter)

AdrianoDee on Mar 20, 2024

+ml

Basic guards to solve the empty input problem in DeepTauId are in place, but the reason of the empty grid needs to be investigated with Tau experts.

A more general guard for empty inputs will be added (see https://github.com/cms-sw/cmssw/issues/44481)

valsdav on Mar 20, 2024

+1 solved by https://github.com/cms-sw/cmssw/pull/44455

jfernan2 on Mar 20, 2024

for record, the proposed fixes are:

https://github.com/cms-sw/cmssw/pull/44455 (master)
https://github.com/cms-sw/cmssw/pull/44456 (14.0.X)

mmusich on Mar 19, 2024

The reproduced succeeds also in 14_1_0_pre1.

I think this happens simply because that particular trigger path (HLT_DoublePFJets40_Mass500_MediumDeepTauPFTauHPS45_L2NN_MediumDeepTauPFTauHPS20_eta2p1_v) got removed in the meanwhile in https://github.com/cms-sw/cmssw/pull/44073 (14_1_X) and https://github.com/cms-sw/cmssw/pull/44074 (14_0_X). I think the reproducer would succeed in CMSSW_14_0_1 as well (but I didn’t test it).

mmusich on Mar 12, 2024