cmssw: DQM Harvest jobs getting stuck at T0

Tests of CMSSW_12_3_3_patch1 at T0 show some Express jobs of task ExpressMergewrite_StreamExpressCosmics_DQMIOEndOfRunDQMHarvestMerged taking more that 20h to complete.

Here we have the logs of one job that finished after 22h of execution:

%MSG-e HLTConfigProvider:   EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun  21-May-2022 17:33:51 CEST Run: 349840
Falling back to ProcessName-only init using ProcessName 'HLT' !
%MSG
%MSG-e HLTConfigProvider:   EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun  21-May-2022 17:33:51 CEST Run: 349840
 Process name 'HLT' not found in registry!
%MSG
[2022-05-22 10:34:32.107229 +0200][Error  ][AsyncSock         ][  549] [p06253947b90717.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 11:33:25.164556 +0200][Error  ][AsyncSock         ][  549] [p06636710g40375.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 12:27:54.115905 +0200][Error  ][AsyncSock         ][  549] [p06253947g81422.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 13:23:41.509030 +0200][Error  ][AsyncSock         ][  549] [p06253947q54042.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 13:53:39.441991 +0200][Error  ][AsyncSock         ][  549] [st-096-gg50030g.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 14:56:59.860605 +0200][Error  ][AsyncSock         ][  549] [st-096-dd904d00.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:06:20.567969 +0200][Error  ][AsyncSock         ][  549] [st-048-388af89c.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:12:21.511984 +0200][Error  ][AsyncSock         ][  549] [p06636710n60578.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:20:03.375578 +0200][Error  ][AsyncSock         ][  549] [p06636710p75593.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
%MSG-e DQMGenericClient:  DQMGenericClient:hltMuonEfficiencies@endRun  22-May-2022 15:40:50 CEST End Run: 349840
 DQMGenericClient::findAllSubdirectories ==> Missing folder HLT/Muon !!!
%MSG
%MSG-e DQMGenericClient:  DQMGenericClient:hltMuonEfficienciesMR@endRun  22-May-2022 15:40:50 CEST End Run: 349840
 DQMGenericClient::findAllSubdirectories ==> Missing folder HLT/Muon/MR !!!
%MSG

The logs report nothing for 20h, after which there a few network error messages apparently related to XRootD and then the job continues to finish in a few minutes. It is worth mentioning that this jobs had already been tried, getting stuck in a similar way. That execution was manually interrupted after the first few network error messages showed up.

Full logs and PSet can be found here:

/afs/cern.ch/user/c/cmst0/public/LongExecution/CollisionMay2022/job_41469/SuccessfulExec/job/WMTaskSpace/cmsRun1

More information on the issue can be found here: CMS Talk post

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 59 (57 by maintainers)

Commits related to this issue

Most upvoted comments

We have some instructions here: https://cmst0.docs.cern.ch/cookbook/debugging/#run-a-job-interactively

A suggestion for un-pickling configurations: you can replace the whole Unpickling the PSet.pkl file (job configuration file) section with a much simpler command:

edmConfigDump PSet.py > unpickled.py

A test job with a cure on the issue has been done, and it takes (from a message from the first event) 27 minutes. Is it reasonable? (Thanks, @fwyzard, your idea is correct.)

I cannot say what is reasonable (this workflow is a bit outside my area of expertise), but what you could do is

  • run without any GEM DQM at all
  • run with the fix

and compare the time it takes to run the two. Just as a guess, the DQM for a single detector should not add more than a few percent to the total.

And also, I found that the PR with this issue has been reverted in CMSSW_12_3_X.

Yes, given the tight schedule at the joint operations meeting this morning it was decided to immediately make a patch release reverting the last GEM DQM PR, and add it back with the fix once it has been validated with more calm in the 12.5.x master branch.

I think I need to make PRs like the following.

  • A PR for the ‘fix’ to the master and its backport to 12_4_X
  • A PR with the reverted contents and this fix to 12_3_X

Is it correct?

Yes, this looks like the correct approach.

A test job with a cure on the issue has been done, and it takes (from a message from the first event) 27 minutes. Is it reasonable? (Thanks, @fwyzard, your idea is correct.)

And also, I found that the PR with this issue has been reverted in CMSSW_12_3_X. I think I need to make PRs like the following.

  • A PR for the ‘fix’ to the master and its backport to 12_4_X
  • A PR with the reverted contents and this fix to 12_3_X

Is it correct?

I have prepared #38052 which reverts https://github.com/cms-sw/cmssw/pull/37852, to have it ready just in case we decide to merge it.

Operationally I would propose:

  • Revert #37852 in 12_3_X, and prepare a patch release with it
  • Prepare a fix for the issue discussed here, and merge it in master and 12_4_X
  • A later backport of the original PR + fix could be integrated in the DQM 12_3_X release, if really wanted

So, looking at

void GEMDQMHarvester::getGeometryInfo(edm::Service<DQMStore> &store, MonitorElement *h2Src) {
  if (h2Src != nullptr) {  // For online and offline
    listLayer_.clear();
    mapIdxLayer_.clear();
    mapNumChPerChamber_.clear();

maybe the calls to clear() should be moved outside of the if ?

void GEMDQMHarvester::getGeometryInfo(edm::Service<DQMStore> &store, MonitorElement *h2Src) {
  listLayer_.clear();
  mapIdxLayer_.clear();
  mapNumChPerChamber_.clear();

  if (h2Src != nullptr) {  // For online and offline

This would at least prevent the size of listLayer_ from growing indefinitely …