cmssw: DQM Harvest jobs getting stuck at T0

Tests of CMSSW_12_3_3_patch1 at T0 show some Express jobs of task ExpressMergewrite_StreamExpressCosmics_DQMIOEndOfRunDQMHarvestMerged taking more that 20h to complete.

Here we have the logs of one job that finished after 22h of execution:

%MSG-e HLTConfigProvider:   EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun  21-May-2022 17:33:51 CEST Run: 349840
Falling back to ProcessName-only init using ProcessName 'HLT' !
%MSG
%MSG-e HLTConfigProvider:   EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun  21-May-2022 17:33:51 CEST Run: 349840
 Process name 'HLT' not found in registry!
%MSG
[2022-05-22 10:34:32.107229 +0200][Error  ][AsyncSock         ][  549] [p06253947b90717.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 11:33:25.164556 +0200][Error  ][AsyncSock         ][  549] [p06636710g40375.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 12:27:54.115905 +0200][Error  ][AsyncSock         ][  549] [p06253947g81422.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 13:23:41.509030 +0200][Error  ][AsyncSock         ][  549] [p06253947q54042.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 13:53:39.441991 +0200][Error  ][AsyncSock         ][  549] [st-096-gg50030g.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 14:56:59.860605 +0200][Error  ][AsyncSock         ][  549] [st-096-dd904d00.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:06:20.567969 +0200][Error  ][AsyncSock         ][  549] [st-048-388af89c.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:12:21.511984 +0200][Error  ][AsyncSock         ][  549] [p06636710n60578.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:20:03.375578 +0200][Error  ][AsyncSock         ][  549] [p06636710p75593.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
%MSG-e DQMGenericClient:  DQMGenericClient:hltMuonEfficiencies@endRun  22-May-2022 15:40:50 CEST End Run: 349840
 DQMGenericClient::findAllSubdirectories ==> Missing folder HLT/Muon !!!
%MSG
%MSG-e DQMGenericClient:  DQMGenericClient:hltMuonEfficienciesMR@endRun  22-May-2022 15:40:50 CEST End Run: 349840
 DQMGenericClient::findAllSubdirectories ==> Missing folder HLT/Muon/MR !!!
%MSG

The logs report nothing for 20h, after which there a few network error messages apparently related to XRootD and then the job continues to finish in a few minutes. It is worth mentioning that this jobs had already been tried, getting stuck in a similar way. That execution was manually interrupted after the first few network error messages showed up.

Full logs and PSet can be found here:

/afs/cern.ch/user/c/cmst0/public/LongExecution/CollisionMay2022/job_41469/SuccessfulExec/job/WMTaskSpace/cmsRun1

More information on the issue can be found here: CMS Talk post

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 59 (57 by maintainers)

Commits related to this issue

Fix on issue #38044 — committed to quark2/cmssw by quark2 2 years ago
Merge pull request #38101 from quark2/GEM-onlineDQMRevive37852-12_5_X Fix on issue #38044 — committed to cms-sw/cmssw by cmsbuild 2 years ago
Fix on issue #38044 + Backport of #37995 — committed to quark2/cmssw by quark2 2 years ago
Merge pull request #38278 from quark2/GEM-onlineDQMFix38044-12_4_X A backport of the fix on issue #38044 to 12_4_X — committed to cms-sw/cmssw by cmsbuild 2 years ago
Merge pull request #38100 from quark2/GEM-onlineDQMRevive37852-12_3_X Revive of reverted #37852 with a fix on the issue #38044 — committed to cms-sw/cmssw by cmsbuild 2 years ago

Most upvoted comments

We have some instructions here: https://cmst0.docs.cern.ch/cookbook/debugging/#run-a-job-interactively

A suggestion for un-pickling configurations: you can replace the whole Unpickling the PSet.pkl file (job configuration file) section with a much simpler command:

edmConfigDump PSet.py > unpickled.py

fwyzard on May 24, 2022

A test job with a cure on the issue has been done, and it takes (from a message from the first event) 27 minutes. Is it reasonable? (Thanks, @fwyzard, your idea is correct.)

I cannot say what is reasonable (this workflow is a bit outside my area of expertise), but what you could do is

run without any GEM DQM at all
run with the fix

and compare the time it takes to run the two. Just as a guess, the DQM for a single detector should not add more than a few percent to the total.

And also, I found that the PR with this issue has been reverted in CMSSW_12_3_X.

Yes, given the tight schedule at the joint operations meeting this morning it was decided to immediately make a patch release reverting the last GEM DQM PR, and add it back with the fix once it has been validated with more calm in the 12.5.x master branch.

I think I need to make PRs like the following.

A PR for the ‘fix’ to the master and its backport to 12_4_X

A PR with the reverted contents and this fix to 12_3_X

Is it correct?

Yes, this looks like the correct approach.

fwyzard on May 24, 2022

A test job with a cure on the issue has been done, and it takes (from a message from the first event) 27 minutes. Is it reasonable? (Thanks, @fwyzard, your idea is correct.)

And also, I found that the PR with this issue has been reverted in CMSSW_12_3_X. I think I need to make PRs like the following.

A PR for the ‘fix’ to the master and its backport to 12_4_X
A PR with the reverted contents and this fix to 12_3_X

Is it correct?

quark2 on May 24, 2022

I have prepared #38052 which reverts https://github.com/cms-sw/cmssw/pull/37852, to have it ready just in case we decide to merge it.

Operationally I would propose:

Revert #37852 in 12_3_X, and prepare a patch release with it
Prepare a fix for the issue discussed here, and merge it in master and 12_4_X
A later backport of the original PR + fix could be integrated in the DQM 12_3_X release, if really wanted

perrotta on May 23, 2022

So, looking at

void GEMDQMHarvester::getGeometryInfo(edm::Service<DQMStore> &store, MonitorElement *h2Src) {
  if (h2Src != nullptr) {  // For online and offline
    listLayer_.clear();
    mapIdxLayer_.clear();
    mapNumChPerChamber_.clear();

maybe the calls to clear() should be moved outside of the if ?

void GEMDQMHarvester::getGeometryInfo(edm::Service<DQMStore> &store, MonitorElement *h2Src) {
  listLayer_.clear();
  mapIdxLayer_.clear();
  mapNumChPerChamber_.clear();

  if (h2Src != nullptr) {  // For online and offline

This would at least prevent the size of listLayer_ from growing indefinitely …

fwyzard on May 23, 2022