cmssw: Crashes in muon HLT reconstruction (`reco::TrackExtra` product not found)

Over the last few weeks, HLT suffered 3 online crashes coming the HLT-muon reconstruction; the first 2 errors are almost identical, while the 3rd one is rather similar to the first 2, but comes from a different producer. The error messages are given below [*].

So far, no one has been able to reproduce any of these errors locally with the relevant error-stream files.

I open an issue to keep track of this, and to ask experts if the error messages suggest to them anything about what might be going wrong.

This config file should be representative of the HLT menu used online during the crashes (representative at least for what concerns the sequences that contain the problematic modules).

FYI: @JanFSchulte @khaosmos93 (Muon-HLT contacts), @silviodonato @Martin-Grunewald @fwyzard

[*]

  1. Run-356433 (CMSSW_12_4_3, Jul 29th):
[2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2802'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
  1. Run-356530 (CMSSW_12_4_3, Aug 1st):
[2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2804'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
  1. Run-357442 (CMSSW_12_4_6, Aug 14th):
[2] Calling method for module TSGForOIFromL2/'hltIterL3OISeedsFromL2MuonsNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2782'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (19 by maintainers)

Commits related to this issue

Most upvoted comments

@Dr15Jones and I discovered a “logical race condition” in the framework that would cause symptoms like this (but we can’t tell if it is really causing these problems).

The L2MuonProducer that produced the Track and TrackExtra collections declares first the production of the Track collection and then the TrackExtra collection https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/RecoMuon/L2MuonProducer/plugins/L2MuonProducer.cc#L114-L117 and Track objects hold Refs to the TrackExtra. Downstream modules consume only the Track collection. The order of produces() declarations dictates the order where Event::commit_aux() moves the products into the EventPrincipal (and to the corresponding ProductResolvers) after the EDProducer::produce() has successfully finished. When a scheduled module puts a product in a ProductResolver, the consumers of that product become eligible to run (unless some other product they depend on have not yet been produced) https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/FWCore/Framework/src/ProductResolvers.cc#L433-L439

This means that the following can happen

  1. the Track collection is put in the ProductResolver (during the end of hltL2Muons module’s produce)
  2. a consumer of that Track collection (e.g. hltIterL3OISeedsFromL2MuonsNoVtx) gets run by another thread
    • the other thread must have nothing else to do in order to steal a task from the thread still running hltL2Muons
  3. the operating system pauses the thread running the hltL2Muons
  4. the consumer calls a method Track that de-references the TrackExtraRef, but the TrackExtra collection is not in the corresponding ProductResolver yet
  5. the hltL2Muons puts the TrackExtra collection into the ProductResolver, but the job is going to terminate

A quick workaround (which I’m going to prepare) is to declare first the production of TrackExtra collection and only then the Track collection.

We will need some time to think for a more general solution (I’d guess the TrackExtra is not the only collection type we have that is only referenced-to by other products without explicit consumption).

As far as I can tell, unscheduled modules are not affected, because for them the product insertion into ProductResolver does not impact module scheduling. Instead, upon prefetch a task that releases all the modules consuming that product is inserted into the WaitingTaskList of the Worker producing the product. https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/FWCore/Framework/src/ProductResolvers.cc#L468-L515