cmssw: Crashes in muon HLT reconstruction (`reco::TrackExtra` product not found)
Over the last few weeks, HLT suffered 3 online crashes coming the HLT-muon reconstruction; the first 2 errors are almost identical, while the 3rd one is rather similar to the first 2, but comes from a different producer. The error messages are given below [*].
So far, no one has been able to reproduce any of these errors locally with the relevant error-stream files.
I open an issue to keep track of this, and to ask experts if the error messages suggest to them anything about what might be going wrong.
This config file should be representative of the HLT menu used online during the crashes (representative at least for what concerns the sequences that contain the problematic modules).
FYI: @JanFSchulte @khaosmos93 (Muon-HLT contacts), @silviodonato @Martin-Grunewald @fwyzard
[*]
- Run-356433 (
CMSSW_12_4_3, Jul 29th):
[2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2802'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
- Run-356530 (
CMSSW_12_4_3, Aug 1st):
[2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2804'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
- Run-357442 (
CMSSW_12_4_6, Aug 14th):
[2] Calling method for module TSGForOIFromL2/'hltIterL3OISeedsFromL2MuonsNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2782'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 19 (19 by maintainers)
Commits related to this issue
- Declare production of Track collection after TrackExtra collection This change works around a rare scheduling bug in the framework when these modules are run as scheduled, see https://github.com/cms-... — committed to makortel/cmssw by makortel 2 years ago
- Declare production of Track collection after TrackExtra collection This change works around a rare scheduling bug in the framework when these modules are run as scheduled, see https://github.com/cms-... — committed to missirol/cmssw by makortel 2 years ago
@Dr15Jones and I discovered a “logical race condition” in the framework that would cause symptoms like this (but we can’t tell if it is really causing these problems).
The
L2MuonProducerthat produced theTrackandTrackExtracollections declares first the production of theTrackcollection and then theTrackExtracollection https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/RecoMuon/L2MuonProducer/plugins/L2MuonProducer.cc#L114-L117 andTrackobjects hold Refs to theTrackExtra. Downstream modules consume only theTrackcollection. The order ofproduces()declarations dictates the order whereEvent::commit_aux()moves the products into theEventPrincipal(and to the corresponding ProductResolvers) after theEDProducer::produce()has successfully finished. When a scheduled module puts a product in a ProductResolver, the consumers of that product become eligible to run (unless some other product they depend on have not yet been produced) https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/FWCore/Framework/src/ProductResolvers.cc#L433-L439This means that the following can happen
Trackcollection is put in the ProductResolver (during the end ofhltL2Muonsmodule’s produce)Trackcollection (e.g.hltIterL3OISeedsFromL2MuonsNoVtx) gets run by another threadhltL2MuonshltL2MuonsTrackthat de-references theTrackExtraRef, but theTrackExtracollection is not in the corresponding ProductResolver yethltL2Muonsputs theTrackExtracollection into the ProductResolver, but the job is going to terminateA quick workaround (which I’m going to prepare) is to declare first the production of
TrackExtracollection and only then theTrackcollection.We will need some time to think for a more general solution (I’d guess the
TrackExtrais not the only collection type we have that is only referenced-to by other products without explicit consumption).As far as I can tell, unscheduled modules are not affected, because for them the product insertion into ProductResolver does not impact module scheduling. Instead, upon prefetch a task that releases all the modules consuming that product is inserted into the
WaitingTaskListof the Worker producing the product. https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/FWCore/Framework/src/ProductResolvers.cc#L468-L515