tensorflow: tf 1.8.0 with horovod hang at the middle of training

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
  • TensorFlow installed from (source or binary): pip install tensorflow_gpu==1.8.0
  • TensorFlow version (use command below):
work@job1b-pub-v100-5wh5z:~$ python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
v1.8.0-0-g93bc2e2072 1.8.0
  • Python version: Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
  • Bazel version (if compiling from source): None
  • GCC/Compiler version (if compiling from source): None
  • CUDA/cuDNN version: cuda 9.0.176-1
  • GPU model and memory: V100 and 32G mem

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

i run the webvision train code in tf 1.8.0 with horovod (nccl), in 4 (nodes) * 8 Nvidia V100 GPU cluster, but the trainning job rang at the middle of training. and one of the nodes processes info are

work       1955      0  0 May09 ?        00:00:00 /bin/sh -c     PATH=/usr/local/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib:$L
work       1961   1955  0 May09 ?        00:00:08 /usr/local/openmpi/bin/orted -mca ess env -mca ess_base_jobid 2660237312 -mca ess_base_vpid 1 -mca ess_bas
work       1965   1961 99 May09 ?        3-20:29:04 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work       1966   1961 99 May09 ?        3-18:25:41 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work       1967   1961 99 May09 ?        5-18:09:20 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work       1968   1961 99 May09 ?        5-12:52:59 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work       1969   1961 99 May09 ?        3-18:07:41 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work       1970   1961 99 May09 ?        3-18:04:06 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work       1971   1961 99 May09 ?        3-18:19:37 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work       1972   1961 99 May09 ?        3-17:27:18 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work       2665   1965  0 May09 ?        00:00:12 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work       2759   1972  0 May09 ?        00:00:14 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work       2813   1971  0 May09 ?        00:00:12 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work       2839   1966  0 May09 ?        00:00:14 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work       2853   1969  0 May09 ?        00:00:12 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work       2861   1967  0 May09 ?        00:00:12 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work       2876   1968  0 May09 ?        00:00:14 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work       2882   1970  0 May09 ?        00:00:14 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work     165333   1965  0 May10 ?        00:00:10 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/

ps-ef-info.txt

and i print the 1965 process python call stack it shows

Thread 0x7f50cdffb700
  File "/home/work/anaconda3/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/home/work/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/work/anaconda3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py", line 391, in _run
    enqueue_callable()
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1244, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)

Thread 0x7f5a3f48c700
  File "webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py", line 1065, in <module>
    tf.app.run(main=main)
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py", line 1036, in main
    save_model_secs=flags.save_model_secs)
  File "/home/work/user-job-dir/webvision-train-code/moxing/tensorflow/executor/learning_builder.py", line 473, in run
    save_model_secs, export_model, use_trt, fetch_strategy_fn, save_model_steps)
  File "/home/work/user-job-dir/webvision-train-code/moxing/tensorflow/executor/learning_wrapper.py", line 270, in run
    self._run()
  File "/home/work/user-job-dir/webvision-train-code/moxing/tensorflow/executor/learning_wrapper.py", line 540, in _run
    self._save_model_steps)
  File "/home/work/user-job-dir/webvision-train-code/moxing/tensorflow/executor/learning.py", line 878, in run
    self.training()
  File "/home/work/user-job-dir/webvision-train-code/moxing/tensorflow/executor/learning.py", line 1530, in training
    self.train_step(sess)
  File "/home/work/user-job-dir/webvision-train-code/moxing/tensorflow/executor/learning.py", line 1274, in train_step
    feed_dict=feed_dict)
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 567, in run
    run_metadata=run_metadata)
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1043, in run
    run_metadata=run_metadata)
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1119, in run
    return self._sess.run(*args, **kwargs)
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1191, in run
    run_metadata=run_metadata)
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 971, in run
    return self._sess.run(*args, **kwargs)
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)

python-call-stack.txt

and gdb bt of this process are

(gdb) bt full
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1  0x00007f5971f7da54 in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) ()
   from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#2  0x00007f5971f7d221 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) ()
   from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#3  0x00007f5971f7a764 in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) () from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#4  0x00007f5971f7ac85 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) ()
   from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#5  0x00007f5971f80f2b in tensorflow::DirectSession::WaitForNotification(tensorflow::DirectSession::RunState*, tensorflow::CancellationManager*, long long)
    () from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#6  0x00007f5971f85010 in tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*) ()
   from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#7  0x00007f5971f8e3d5 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
   from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#8  0x00007f596f424c8a in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, TF_Tensor**, std::vector<std::string, std::allocator<std::string> > const&, TF_Buffer*, TF_Status*) ()
   from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#9  0x00007f596f425886 in TF_SessionRun () from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#10 0x00007f596f0d4186 in tensorflow::TF_SessionRun_wrapper_helper(TF_Session*, char const*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Outp---Type <return> to continue, or q <return> to quit---
ut> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) ()
   from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#11 0x00007f596f0d42ca in tensorflow::TF_SessionRun_wrapper(TF_Session*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) ()
   from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#12 0x00007f596f090a6e in _wrap_TF_SessionRun_wrapper ()
   from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

gdb-backtrace.txt

and it seems it wait at the direct_session.cc, L552

  WaitForNotification(&run_state, &step_cancellation_manager,
                      run_options.timeout_in_ms() > 0
                          ? run_options.timeout_in_ms()
                          : operation_timeout_in_ms_);

but i never set the run_options.timeout_in_ms and config.operation_timeout_in_ms, and it seems it wait for the deadline in gdb bt, so the question is how can this forever hang happen when the nsync::nsync_cv_wait_with_deadline be called ? (at least for now, it runs almost three days …)

oh, i know why … the gdb lack of some output (as it lack of symbols …) ref https://github.com/tensorflow/tensorflow/issues/26559

nsync::nsync_cv_wait -> nsync::nsync_cv_wait_with_deadline

Describe the expected behavior

Not hang at the middle of training, the final train log are

INFO:tensorflow:step: 162300(global step: 162300)       sample/sec: 368.464     ent_loss: 4.997 top-1: 0.375    top-5: 0.547    reg_loss: 0.309 total_loss: 5.306
INFO:tensorflow:step: 162300(global step: 162300)       sample/sec: 368.328     ent_loss: 4.942 top-1: 0.344    top-5: 0.531    reg_loss: 0.309 total_loss: 5.251
INFO:tensorflow:step: 162300(global step: 162300)       sample/sec: 368.118     ent_loss: 4.036 top-1: 0.578    top-5: 0.672    reg_loss: 0.309 total_loss: 4.344
INFO:tensorflow:step: 162300(global step: 162300)       sample/sec: 367.937     ent_loss: 4.518 top-1: 0.453    top-5: 0.578    reg_loss: 0.309 total_loss: 4.827
INFO:tensorflow:global_step/sec: 1.46428
INFO:tensorflow:step: 162300(global step: 162300)       sample/sec: 367.743     ent_loss: 4.083 top-1: 0.500    top-5: 0.703    reg_loss: 0.309 total_loss: 4.392
INFO:tensorflow:global_step/sec: 1.46428
INFO:tensorflow:step: 162300(global step: 162300)       sample/sec: 286.646     ent_loss: 3.888 top-1: 0.484    top-5: 0.641    reg_loss: 0.309 total_loss: 4.197
INFO:tensorflow:step: 162300(global step: 162300)       sample/sec: 286.646     ent_loss: 3.888 top-1: 0.484    top-5: 0.641    reg_loss: 0.309 total_loss: 4.197

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

not sure …

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

all nodes GPU utils is 100%

[root@job1b-pub-v100-5wh5z ~]# /var/paas/nvidia/bin/nvidia-smi 
Sun May 12 20:19:41 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.93       Driver Version: 410.93       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:2D:00.0 Off |                    0 |
| N/A   41C    P0    65W / 300W |  29952MiB / 32480MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:32:00.0 Off |                    0 |
| N/A   40C    P0    71W / 300W |  29954MiB / 32480MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:5B:00.0 Off |                    0 |
| N/A   42C    P0    69W / 300W |  29942MiB / 32480MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:5F:00.0 Off |                    0 |
| N/A   37C    P0    64W / 300W |  29942MiB / 32480MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:B5:00.0 Off |                    0 |
| N/A   40C    P0    67W / 300W |  29954MiB / 32480MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:BE:00.0 Off |                    0 |
| N/A   39C    P0    66W / 300W |  29954MiB / 32480MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:E1:00.0 Off |                    0 |
| N/A   41C    P0    67W / 300W |  29954MiB / 32480MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:E9:00.0 Off |                    0 |
| N/A   41C    P0    67W / 300W |  29954MiB / 32480MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     71807      C   /home/work/anaconda3/bin/python            29939MiB |
|    1     71808      C   /home/work/anaconda3/bin/python            29939MiB |
|    2     71809      C   /home/work/anaconda3/bin/python            29927MiB |
|    3     71810      C   /home/work/anaconda3/bin/python            29927MiB |
|    4     71811      C   /home/work/anaconda3/bin/python            29939MiB |
|    5     71812      C   /home/work/anaconda3/bin/python            29939MiB |
|    6     71813      C   /home/work/anaconda3/bin/python            29939MiB |
|    7     71814      C   /home/work/anaconda3/bin/python            29939MiB |
+-----------------------------------------------------------------------------+

and dmesg output the segment fault info

[Sun May 12 19:53:37 2019] sh[429661]: segfault at 7fff7f09f428 ip 00007fff7f09f428 sp 00007fff7f09e078 error 15
[Sun May 12 19:54:37 2019] sh[433888]: segfault at 7ffe30bcccd8 ip 00007ffe30bcccd8 sp 00007ffe30bcb928 error 15
[Sun May 12 19:55:37 2019] sh[437936]: segfault at 7ffc5e4a0958 ip 00007ffc5e4a0958 sp 00007ffc5e49f5a8 error 15
[Sun May 12 19:56:37 2019] sh[442090]: segfault at 7ffc4ad375b8 ip 00007ffc4ad375b8 sp 00007ffc4ad36208 error 15
[Sun May 12 19:57:37 2019] sh[446313]: segfault at 7ffcba547148 ip 00007ffcba547148 sp 00007ffcba545d98 error 15
[Sun May 12 19:57:37 2019] sh[446409]: segfault at 7ffca4e2dc18 ip 00007ffca4e2dc18 sp 00007ffca4e2c868 error 15
[Sun May 12 19:58:37 2019] sh[450425]: segfault at 7ffd09ab86d8 ip 00007ffd09ab86d8 sp 00007ffd09ab7328 error 15
[Sun May 12 19:59:37 2019] sh[454603]: segfault at 7ffca20adf08 ip 00007ffca20adf08 sp 00007ffca20acb58 error 15
[Sun May 12 20:00:37 2019] sh[499]: segfault at 7ffc588dd7e8 ip 00007ffc588dd7e8 sp 00007ffc588dc438 error 15
[Sun May 12 20:01:37 2019] sh[4810]: segfault at 7ffe67cdfbd8 ip 00007ffe67cdfbd8 sp 00007ffe67cde828 error 15
[Sun May 12 20:02:37 2019] sh[8913]: segfault at 7ffc6fd6a3b8 ip 00007ffc6fd6a3b8 sp 00007ffc6fd69008 error 15
[Sun May 12 20:02:37 2019] sh[8973]: segfault at 7ffe613b4388 ip 00007ffe613b4388 sp 00007ffe613b2fd8 error 15
[Sun May 12 20:03:37 2019] sh[13161]: segfault at 7fff17ac1bd8 ip 00007fff17ac1bd8 sp 00007fff17ac0828 error 15
[Sun May 12 20:04:37 2019] sh[17308]: segfault at 7fff5fe44448 ip 00007fff5fe44448 sp 00007fff5fe43098 error 15
[Sun May 12 20:05:37 2019] sh[21475]: segfault at 7ffeca86ff58 ip 00007ffeca86ff58 sp 00007ffeca86eba8 error 15
[Sun May 12 20:06:37 2019] sh[25887]: segfault at 7ffea93cd2a8 ip 00007ffea93cd2a8 sp 00007ffea93cbef8 error 15
[Sun May 12 20:07:37 2019] sh[30090]: segfault at 7ffe9a32ba78 ip 00007ffe9a32ba78 sp 00007ffe9a32a6c8 error 15
[Sun May 12 20:07:37 2019] sh[30151]: segfault at 7ffe0e8b00e8 ip 00007ffe0e8b00e8 sp 00007ffe0e8aed38 error 15
[Sun May 12 20:08:37 2019] sh[34257]: segfault at 7fff46231e38 ip 00007fff46231e38 sp 00007fff46230a88 error 15
[Sun May 12 20:09:37 2019] sh[38356]: segfault at 7ffc8b9d2ff8 ip 00007ffc8b9d2ff8 sp 00007ffc8b9d1c48 error 15
[Sun May 12 20:10:37 2019] sh[42685]: segfault at 7ffd19c3c068 ip 00007ffd19c3c068 sp 00007ffd19c3acb8 error 15
[Sun May 12 20:11:37 2019] sh[46730]: segfault at 7ffc11dcc218 ip 00007ffc11dcc218 sp 00007ffc11dcae68 error 15
[Sun May 12 20:12:37 2019] sh[50885]: segfault at 7ffde26e73c8 ip 00007ffde26e73c8 sp 00007ffde26e6018 error 15
[Sun May 12 20:12:37 2019] sh[51091]: segfault at 7ffcaaf42788 ip 00007ffcaaf42788 sp 00007ffcaaf413d8 error 15
[Sun May 12 20:13:37 2019] sh[55116]: segfault at 7ffc9faf70a8 ip 00007ffc9faf70a8 sp 00007ffc9faf5cf8 error 15
[Sun May 12 20:14:37 2019] sh[59159]: segfault at 7fff7ed38518 ip 00007fff7ed38518 sp 00007fff7ed37168 error 15
[Sun May 12 20:15:37 2019] sh[63389]: segfault at 7ffcf8b8d068 ip 00007ffcf8b8d068 sp 00007ffcf8b8bcb8 error 15
[Sun May 12 20:16:37 2019] sh[67564]: segfault at 7fffc0e890d8 ip 00007fffc0e890d8 sp 00007fffc0e87d28 error 15
[Sun May 12 20:17:37 2019] sh[71615]: segfault at 7ffd22222168 ip 00007ffd22222168 sp 00007ffd22220db8 error 15
[Sun May 12 20:17:37 2019] sh[71666]: segfault at 7ffc62aff5e8 ip 00007ffc62aff5e8 sp 00007ffc62afe238 error 15
[Sun May 12 20:18:37 2019] sh[75987]: segfault at 7ffdf614bd88 ip 00007ffdf614bd88 sp 00007ffdf614a9d8 error 15
[Sun May 12 20:19:37 2019] sh[80035]: segfault at 7ffffcbe7c38 ip 00007ffffcbe7c38 sp 00007ffffcbe6888 error 15

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 16 (9 by maintainers)

Most upvoted comments

@jvishnuvardhan , sorry for the late reply, we suppose it is resolved now after upgrading the nccl to 2.4.7

thx @alsrgv ~ i will tell my colleagues for this action, they have been stuck at this problem for too long since last time i posted this issue … -_-# i didn’t have much time to follow it

@alsrgv thx for reply 😃, i will try, horovod is quite a interesting framework, i would like to dive into it ~

and, to clarify … the segment faults in my previous post are unrelated …

[Sun May 12 20:17:37 2019] sh[71666]: segfault at 7ffc62aff5e8 ip 00007ffc62aff5e8 sp 00007ffc62afe238 error 15
[Sun May 12 20:18:37 2019] sh[75987]: segfault at 7ffdf614bd88 ip 00007ffdf614bd88 sp 00007ffdf614a9d8 error 15
[Sun May 12 20:19:37 2019] sh[80035]: segfault at 7ffffcbe7c38 ip 00007ffffcbe7c38 sp 00007ffffcbe6888 error 15

we found that one of our agents periodly call the monitor shell script (sh), but this script failed to execute as it lack of some sys libs …