cmssw: DQM Harvest jobs getting stuck at T0
Tests of CMSSW_12_3_3_patch1 at T0 show some Express jobs of task ExpressMergewrite_StreamExpressCosmics_DQMIOEndOfRunDQMHarvestMerged
taking more that 20h to complete.
Here we have the logs of one job that finished after 22h of execution:
%MSG-e HLTConfigProvider: EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun 21-May-2022 17:33:51 CEST Run: 349840
Falling back to ProcessName-only init using ProcessName 'HLT' !
%MSG
%MSG-e HLTConfigProvider: EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun 21-May-2022 17:33:51 CEST Run: 349840
Process name 'HLT' not found in registry!
%MSG
[2022-05-22 10:34:32.107229 +0200][Error ][AsyncSock ][ 549] [p06253947b90717.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 11:33:25.164556 +0200][Error ][AsyncSock ][ 549] [p06636710g40375.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 12:27:54.115905 +0200][Error ][AsyncSock ][ 549] [p06253947g81422.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 13:23:41.509030 +0200][Error ][AsyncSock ][ 549] [p06253947q54042.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 13:53:39.441991 +0200][Error ][AsyncSock ][ 549] [st-096-gg50030g.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 14:56:59.860605 +0200][Error ][AsyncSock ][ 549] [st-096-dd904d00.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:06:20.567969 +0200][Error ][AsyncSock ][ 549] [st-048-388af89c.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:12:21.511984 +0200][Error ][AsyncSock ][ 549] [p06636710n60578.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:20:03.375578 +0200][Error ][AsyncSock ][ 549] [p06636710p75593.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
%MSG-e DQMGenericClient: DQMGenericClient:hltMuonEfficiencies@endRun 22-May-2022 15:40:50 CEST End Run: 349840
DQMGenericClient::findAllSubdirectories ==> Missing folder HLT/Muon !!!
%MSG
%MSG-e DQMGenericClient: DQMGenericClient:hltMuonEfficienciesMR@endRun 22-May-2022 15:40:50 CEST End Run: 349840
DQMGenericClient::findAllSubdirectories ==> Missing folder HLT/Muon/MR !!!
%MSG
The logs report nothing for 20h, after which there a few network error messages apparently related to XRootD and then the job continues to finish in a few minutes. It is worth mentioning that this jobs had already been tried, getting stuck in a similar way. That execution was manually interrupted after the first few network error messages showed up.
Full logs and PSet can be found here:
/afs/cern.ch/user/c/cmst0/public/LongExecution/CollisionMay2022/job_41469/SuccessfulExec/job/WMTaskSpace/cmsRun1
More information on the issue can be found here: CMS Talk post
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 59 (57 by maintainers)
Commits related to this issue
- Fix on issue #38044 — committed to quark2/cmssw by quark2 2 years ago
- Merge pull request #38101 from quark2/GEM-onlineDQMRevive37852-12_5_X Fix on issue #38044 — committed to cms-sw/cmssw by cmsbuild 2 years ago
- Fix on issue #38044 + Backport of #37995 — committed to quark2/cmssw by quark2 2 years ago
- Merge pull request #38278 from quark2/GEM-onlineDQMFix38044-12_4_X A backport of the fix on issue #38044 to 12_4_X — committed to cms-sw/cmssw by cmsbuild 2 years ago
- Merge pull request #38100 from quark2/GEM-onlineDQMRevive37852-12_3_X Revive of reverted #37852 with a fix on the issue #38044 — committed to cms-sw/cmssw by cmsbuild 2 years ago
A suggestion for un-pickling configurations: you can replace the whole Unpickling the PSet.pkl file (job configuration file) section with a much simpler command:
I cannot say what is reasonable (this workflow is a bit outside my area of expertise), but what you could do is
and compare the time it takes to run the two. Just as a guess, the DQM for a single detector should not add more than a few percent to the total.
Yes, given the tight schedule at the joint operations meeting this morning it was decided to immediately make a patch release reverting the last GEM DQM PR, and add it back with the fix once it has been validated with more calm in the 12.5.x master branch.
Yes, this looks like the correct approach.
A test job with a cure on the issue has been done, and it takes (from a message from the first event) 27 minutes. Is it reasonable? (Thanks, @fwyzard, your idea is correct.)
And also, I found that the PR with this issue has been reverted in CMSSW_12_3_X. I think I need to make PRs like the following.
Is it correct?
I have prepared #38052 which reverts https://github.com/cms-sw/cmssw/pull/37852, to have it ready just in case we decide to merge it.
Operationally I would propose:
So, looking at
maybe the calls to
clear()
should be moved outside of theif
?This would at least prevent the size of
listLayer_
from growing indefinitely …