cmssw: `DeepTauId` failures in RelVals (`Incompatible shapes`)

Running RelVals we are observing some failures due to a tensorflow exception coming from DeepTauId module. Some examples listed here.

1) 2023 Data reHLT + reRECO

In HLTDR3_2023 step in path HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7 in 14_0_0_pre3 RelVals

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 367131 lumi: 11 event: 22076365 stream: 0
[1] Running path 'HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducerForVBFIsoTau'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
[[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]

with the config here, that is what we get from wf 141.035 running L1REPACK:Full,HLT:@relval2024 (HLT pointing at GRun here). The error here. The wf on Stats2.

Also in the same step in 13_3_0_pre5 RunDisplacedJet2023C in a different path (HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet60_v6 ) run in HLT:@relval2023. The error here. The wf on Stats2.

2) 2022 Data reHLT + reRECO

Much rarer in AODNANORUN3_reHLT_2022 step in deepTau2017v2p1ForMini in RunJetMET2022D with 14_0_0 The error here. The wf on Stats2.

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 357735 lumi: 20 event: 32782226 stream: 0
[1] Running path 'NANOEDMAODoutput_step'
[2] Prefetching for module PoolOutputModule/'NANOEDMAODoutput'
[3] Prefetching for module SimpleCandidateFlatTableProducer/'boostedTauTable'
[4] Prefetching for module PATObjectCrossLinker/'linkedObjects'
[5] Prefetching for module PATJetRefSelector/'finalJetsPuppi'
[6] Prefetching for module PATJetUserDataEmbedder/'updatedJetsPuppiWithUserData'
[7] Prefetching for module PATJetUpdater/'updatedJetsPuppi'
[8] Prefetching for module PATJetSelector/'slimmedJetsPuppi'
[9] Prefetching for module PATJetUpdater/'updatedPatJetsTransientCorrectedSlimmedPuppiWithDeepTags'
[10] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetFromMiniAODAK4PuppiCentralJetTagsSlimmedPuppiWithDeepTags'
[11] Prefetching for module ParticleNetFeatureEvaluator/'pfParticleNetFromMiniAODAK4PuppiCentralTagInfosSlimmedPuppiWithDeepTags'
[12] Prefetching for module PATTauIDEmbedder/'slimmedTaus'
[13] Calling method for module DeepTauId/'deepTau2017v2p1ForMini'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
[[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]

3) MC 2023

In DigiPU_2023PU step in hltHpsPFTauDeepTauProducer in RelValTenTau_15_500 with 13_3_0_pre1 (at the moment the first occurrence I found). The error here. The wf on Stats2.

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 1 lumi: 18 event: 1707 stream: 1
[1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_OneProng_M5to80_v2'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,38] vs. [92]
[[{{node inner_hadrons_norm_1/FusedBatchNorm_1/Mul}}]]

CPU

At the moment it appears that in all cases the jobs were running on Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz (or on a Gold one), Cascade Lake (see https://github.com/cms-sw/cmssw/issues/44333#issuecomment-1983672263).

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Comments: 47 (47 by maintainers)

Most upvoted comments

Hi all! I investigated the reproducer and I think I found the issue.

The number of valid_grid_cells here is 0 for this event and this is creating a TF::Tensor with shape [0, 1, 1, N].

In TensorFlow this is a valid tensor which has a specific shape but it is empty.

>>> import tensorflow as tf
>>> tensor = tf.zeros([0, 1, 1, 86])
>>> tensor
<tf.Tensor: shape=(0, 1, 1, 86), dtype=float32, numpy=array([], shape=(0, 1, 1, 86), dtype=float32)>
>>> tf.print(tensor)
[]

Apparently, when this input is passed to a TF model executed on a CPU without AVX512F AVX512_VNNI, the model is executed and returns an empty output without complaining. When AVX512F AVX512_VNNI instructions are present, the jitting is different and the TF executor complains. Now, I’m not saying that it is understood why this happens, but this is the reason of the crash.

I can prepare a PR with guards to avoid the execution of the model with empty inputs, and in parallel investigate more deeply this TF behaviour.

urgent

This failure was now seen in Tier0 PromptReco https://cms-talk.web.cern.ch/t/update-t0-skim-config-for-2024-pp-collision/36794/5

I can prepare a PR with guards to avoid the execution of the model with empty inputs, and in parallel investigate more deeply this TF behaviour.

@valsdav, we have established that this issue can affect Prompt Reconstruction and (potentially, when the new nodes for the HLT farm arrive) also online trigger operations. Please prepare PRs with guards to avoid the execution of the model with empty inputs. Thank you.

Marco (as ORM)

@cmsbuild, please close

+hlt

  • no issues observed after the 14.0.X PR got merged and tested in IBs.

but the reason of the empty grid needs to be investigated with Tau experts.

@cms-sw/reconstruction-l2 this looks like needs a separate issue. Can you open one?

+pdmv (really only the reporter)

+ml

Basic guards to solve the empty input problem in DeepTauId are in place, but the reason of the empty grid needs to be investigated with Tau experts.

A more general guard for empty inputs will be added (see https://github.com/cms-sw/cmssw/issues/44481)

The reproduced succeeds also in 14_1_0_pre1.

I think this happens simply because that particular trigger path (HLT_DoublePFJets40_Mass500_MediumDeepTauPFTauHPS45_L2NN_MediumDeepTauPFTauHPS20_eta2p1_v) got removed in the meanwhile in https://github.com/cms-sw/cmssw/pull/44073 (14_1_X) and https://github.com/cms-sw/cmssw/pull/44074 (14_0_X). I think the reproducer would succeed in CMSSW_14_0_1 as well (but I didn’t test it).

since it would end up running the same reHLT process on top of the same Event (195390586) of the same Run (367131) for which the failure appears

since the process is run multi-threaded are you sure that the last event that leaves a message logger record is also the one crashing the process?

assign reconstruction

assign ml

HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet60_v6

This is a different path, so it points to a general problem with DeepTauId (path-aspecific)

type tau

assign pdmv

assign hlt