cmssw: [ASAN] Unit Test `testTriggerEventAnalyzers` failing with heap-buffer-overflow

Hello,

Unit Test testTriggerEventAnalyzers (from module HLTrigger/HLTcore) is failing with heap-buffer-overflow. Here the summary:

SUMMARY: AddressSanitizer: heap-buffer-overflow (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02775/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_ASAN_X_2023-03-10-2300/lib/el8_amd64_gcc11/pluginHLTriggerHLTcorePlugins.so+0xf1638) in void HLTEventAnalyzerRAW::showObjects<std::vector<int, std::allocator<int> >, std::vector<edm::Ref<std::vector<reco::PFJet,
std::allocator<reco::PFJet> >, reco::PFJet, edm::refhelper::FindUsingAdvance<std::vector<reco::PFJet,
std::allocator<reco::PFJet> >, reco::PFJet> >, std::allocator<edm::Ref<std::vector<reco::PFJet,
std::allocator<reco::PFJet> >, reco::PFJet, edm::refhelper::FindUsingAdvance<std::vector<reco::PFJet,
std::allocator<reco::PFJet> >, reco::PFJet> > > > >(std::vector<int, std::allocator<int> > const&, std::vector<edm::Ref<std::vector<reco::PFJet, std::allocator<reco::PFJet> >, reco::PFJet,
edm::refhelper::FindUsingAdvance<std::vector<reco::PFJet, std::allocator<reco::PFJet> >, reco::PFJet> >,
std::allocator<edm::Ref<std::vector<reco::PFJet, std::allocator<reco::PFJet> >, reco::P

Find the full log in this link.

Thanks!

FYI, @missirol (I see you added this test to CMSSW)

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 22 (22 by maintainers)

Most upvoted comments

I guess my question was even more naive, and is about this line in HLTJetTag. There, the key can be larger than the size of *h_Jets. When this happens, is it possible that this could crash at runtime? And why don’t we see ASAN errors from HLTJetTag itself? (sorry if these questions are a bit too basic)

@makortel , would you mind explaining this ?

The edm::Ref constructor does not try to access the element pointed by the index. The invalid memory access can therefore only occur when the Ref is de-referenced.

Slowly converging on this issue.

  • step 0 (#41350): the unit-test failure can be fixed by using a different input file.

  • step 1 (CMSHLT-2750): this issue highlighted a long-standing misconfiguration of a few b-tagging modules of the HLT pp menu (according to my checks, there are 7 affected modules). The first step is to fix this in the HLT menu, by replacing the ill-defined instances of HLTJetTag<T> with instances of HLTJetTagWithMatching<T> (which uses delta-R matching, appropriate for cases where the two jets collection are not identical);

  • step 2 (#41351, #41352): part of the problem is that the implementation of the plugin HLTJetTag<T> is fragile, so it should be improved. This step-2 must go after step-1, because the improved/stricter version of HLTJetTag<T> would lead to runtime errors with the current HLT menu. This also implies that step-2 can be completed only after we remove the frozen 2022 HLT menu from 13_0_X and 13_1_X (otherwise, 2022 RelVals would fail). Retiring the 2022 HLT menu in those releases was going to happen in any case, as HLT usually runs only one non-fake HLT menu in a given release (for 13_0_X and 13_1_X, that is the 2023 HLT menu).

  • step 3: since the new version of HLTJetTag<T> will have an option to use delta-R matching, HLTJetTag<T> will become a superset of HLTJetTagWithMatching<T>, so I plan to replace all instances HLTJetTagWithMatching<T> in the HLT menus with properly-configured instances of the new HLTJetTag<T>. After that, HLTJetTagWithMatching<T> can be removed from the master branch of CMSSW.

Once there is data/MC processed with an updated HLT menu, the input file of the unit test could be updated again (with a proper input, more is tested, as this issue shows).

By the way, thanks a lot for opening this issue (@aandvalenzuela). I doubt we would have caught this bug otherwise (and clearly, the bug went unnoticed for a long time).

One naive question: why didn’t ‘anything’ catch this earlier? It was sort-of caught by accident (in a unit test of something pretty unrelated, in ASAN). Was this issue triggering ‘alarms’ anywhere else (e.g. other tests in ASAN, or else)? If not, why was it going unnoticed?