cmssw: Inconsistent exit codes for fatal root errors and xrootd errors

When diagnosing failures in grid jobs, precise exit code meanings are very valuable. Currently, several exit codes are overloaded or used (in my opinion) inconsistently.

One category of errors is “Fatal Root Error”. This occurs when there is some kind of file corruption. Corrupt files have to be fixed centrally, so it is very important to be able to isolate these. Here are some examples of the exit codes that can occur with “Fatal Root Error” and related exception messages:

exit code 84:

----- Begin Fatal Exception 30-Apr-2024 04:07:50 CEST-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootFileSequenceBase::initTheFile()
   Additional Info:
      [a] Input file root://cmsxrootd.fnal.gov//store/data/Run2018D/SingleMuon/MINIAOD/UL2018_MiniAODv2-v3/280000/AB8F5CB8-2B5B-1945-967F-89EDEC3346AD.root could not be opened.
      [b] Fatal Root Error: @SUB=TStorageFactoryFile::Init
root://cmsxrootd.fnal.gov//store/data/Run2018D/SingleMuon/MINIAOD/UL2018_MiniAODv2-v3/280000/AB8F5CB8-2B5B-1945-967F-89EDEC3346AD.root not a ROOT file

----- End Fatal Exception -------------------------------------------------

exit code 85:

----- Begin Fatal Exception 30-Apr-2024 03:08:20 BST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Reading branch EventAuxiliary
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::Streamer
The value of fKeylen is incorrect (-25451) ; trying to recover by setting it to zero

----- End Fatal Exception -------------------------------------------------

exit code 86:

----- Begin Fatal Exception 30-Apr-2024 04:07:56 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
fNbytes = 84983, fKeylen = 98, fObjlen = 395391, noutot = 0, nout=0, nin=84885, nbuf=395391

----- End Fatal Exception -------------------------------------------------

Another category of errors are xrootd issues. These are usually transient, though they can indicate that a file is not accessible anywhere on disk. Here are some examples:

exit code 84:

----- Begin Fatal Exception 27-Apr-2024 16:21:06 CEST-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootFileSequenceBase::initTheFile()
   [3] Calling StorageFactory::open()
   [4] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://cmsxrootd.fnal.gov//store/mc/RunIISummer20UL18MiniAODv2/WJetsToLNu_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/00000/3E80D215-AFE7-8D46-A8D8-56D2E670C280.root'
   Additional Info:
      [a] Input file root://cmsxrootd.fnal.gov//store/mc/RunIISummer20UL18MiniAODv2/WJetsToLNu_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/00000/3E80D215-AFE7-8D46-A8D8-56D2E670C280.root could not be opened.
      [b] XrdCl::File::Open(name='root://cmsxrootd.fnal.gov//store/mc/RunIISummer20UL18MiniAODv2/WJetsToLNu_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/00000/3E80D215-AFE7-8D46-A8D8-56D2E670C280.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] No servers are available to read the file.
' (errno=3011, code=400). No additional data servers were found.
      [c] Last URL tried: root://cms-xrd-global.cern.ch:1094//store/mc/RunIISummer20UL18MiniAODv2/WJetsToLNu_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/00000/3E80D215-AFE7-8D46-A8D8-56D2E670C280.root?tried=+1213cmsxrootd2.fnal.gov1213xrootd.unl.edu,
      [d] Problematic data server: cms-xrd-global.cern.ch:1094
      [e] Disabled source: cms-xrd-global.cern.ch:1094
----- End Fatal Exception -------------------------------------------------

exit code 85:

----- Begin Fatal Exception 30-Apr-2024 11:02:48 CST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Calling InputSource::getNextItemType
   [1] Reading branch EventAuxiliary
   [2] Calling XrdFile::readv()
   [3] XrdAdaptor::ClientRequest::HandleResponse() failure while running connection recovery
   [4] Handling XrdAdaptor::RequestManager::requestFailure()
   [5] In XrdAdaptor::RequestManager::OpenHandler::HandleResponseWithHosts()
Exception Message:
XrdCl::File::Open(name='root://cmsxrootd.fnal.gov//store/mc/RunIISummer20UL17MiniAODv2/QCD_Pt_470to600_TuneCP5_13TeV_pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v1/230000/1D057463-6DF2-5849-B64C-B1CCB513876F.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] No servers are available to read the file.
' (errno=3011, code=400)
   Additional Info:
      [a] Original error: '[ERROR] Operation expired' (errno=0, code=206, source=t2srv0015.cmsaf.mit.edu:1094 (site T2_US_MIT)).
      [b] Original failed source is t2srv0015.cmsaf.mit.edu:1094 (site T2_US_MIT)
      [c] Disabled source: t2srv0015.cmsaf.mit.edu:1094
----- End Fatal Exception -------------------------------------------------

I propose reserving exit code 84 for xrootd file open error (indicating file missing from disk), exit code 85 for xroot file read error (transient), and exit code 86 for all fatal root errors (indicating file corruption). I am open to other proposals to resolve these ambiguities.

About this issue

  • Original URL
  • State: open
  • Created 2 months ago
  • Comments: 16 (15 by maintainers)

Commits related to this issue

Most upvoted comments

For the errors originating from the xrootd layer we could think of adding RemoteReadError exit code.

assign core