cmssw: Problem with mxnet on CC8

Running workflows on CC8 which usie mxnet where the jobs use multilple threads leads to a crash at the end of the job. This can sometimes be reproduced when running under the gdb which yields the following traceback

#0  0x000000003442bf80 in ?? ()
#1  0x00007fff9f0f0472 in mxnet::resource::ResourceManagerImpl::~ResourceManagerImpl() () from /cvmfs/cms-ib.cern.ch/week0/cc8_amd64_gcc8/cms/cmssw-patch/CMSSW_11_2_X_2020-06-26-1100/external/cc8_amd64_gcc8/lib/libmxnet.so
#2  0x00007fff9f0f0ca5 in dmlc::ThreadLocalStore<mxnet::resource::ResourceManagerImpl>::~ThreadLocalStore() ()
   from /cvmfs/cms-ib.cern.ch/week0/cc8_amd64_gcc8/cms/cmssw-patch/CMSSW_11_2_X_2020-06-26-1100/external/cc8_amd64_gcc8/lib/libmxnet.so
#3  0x00007ffff552406c in __run_exit_handlers () from /lib64/libc.so.6
#4  0x00007ffff55241a0 in exit () from /lib64/libc.so.6
#5  0x00007ffff550d87a in __libc_start_main () from /lib64/libc.so.6
#6  0x000000000041145e in _start ()

One question, has the libmxnet.so shared library already been unloaded from the process before this happened?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 49 (49 by maintainers)

Commits related to this issue

Most upvoted comments

@Dr15Jones (et al.) after the merge of #41377 this can be probably considered fixed, and closed