tensorflow: tf 1.8.0 with horovod hang at the middle of training
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
yes - OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Linux Ubuntu 16.04 - Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
No - TensorFlow installed from (source or binary):
pip install tensorflow_gpu==1.8.0 - TensorFlow version (use command below):
work@job1b-pub-v100-5wh5z:~$ python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
v1.8.0-0-g93bc2e2072 1.8.0
- Python version:
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) - Bazel version (if compiling from source): None
- GCC/Compiler version (if compiling from source): None
- CUDA/cuDNN version:
cuda 9.0.176-1 - GPU model and memory:
V100and32Gmem
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
i run the webvision train code in tf 1.8.0 with horovod (nccl), in 4 (nodes) * 8 Nvidia V100 GPU cluster, but the trainning job rang at the middle of training. and one of the nodes processes info are
work 1955 0 0 May09 ? 00:00:00 /bin/sh -c PATH=/usr/local/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib:$L
work 1961 1955 0 May09 ? 00:00:08 /usr/local/openmpi/bin/orted -mca ess env -mca ess_base_jobid 2660237312 -mca ess_base_vpid 1 -mca ess_bas
work 1965 1961 99 May09 ? 3-20:29:04 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work 1966 1961 99 May09 ? 3-18:25:41 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work 1967 1961 99 May09 ? 5-18:09:20 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work 1968 1961 99 May09 ? 5-12:52:59 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work 1969 1961 99 May09 ? 3-18:07:41 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work 1970 1961 99 May09 ? 3-18:04:06 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work 1971 1961 99 May09 ? 3-18:19:37 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work 1972 1961 99 May09 ? 3-17:27:18 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3
work 2665 1965 0 May09 ? 00:00:12 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work 2759 1972 0 May09 ? 00:00:14 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work 2813 1971 0 May09 ? 00:00:12 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work 2839 1966 0 May09 ? 00:00:14 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work 2853 1969 0 May09 ? 00:00:12 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work 2861 1967 0 May09 ? 00:00:12 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work 2876 1968 0 May09 ? 00:00:14 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work 2882 1970 0 May09 ? 00:00:14 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
work 165333 1965 0 May10 ? 00:00:10 /home/work/anaconda3/bin/python webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py --data_url=s3:/
and i print the 1965 process python call stack it shows
Thread 0x7f50cdffb700
File "/home/work/anaconda3/lib/python3.6/threading.py", line 884, in _bootstrap
self._bootstrap_inner()
File "/home/work/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/work/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py", line 391, in _run
enqueue_callable()
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1244, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
Thread 0x7f5a3f48c700
File "webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py", line 1065, in <module>
tf.app.run(main=main)
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "webvision-train-code/train_all_data_fixed_lr_64_gpu_5-8.py", line 1036, in main
save_model_secs=flags.save_model_secs)
File "/home/work/user-job-dir/webvision-train-code/moxing/tensorflow/executor/learning_builder.py", line 473, in run
save_model_secs, export_model, use_trt, fetch_strategy_fn, save_model_steps)
File "/home/work/user-job-dir/webvision-train-code/moxing/tensorflow/executor/learning_wrapper.py", line 270, in run
self._run()
File "/home/work/user-job-dir/webvision-train-code/moxing/tensorflow/executor/learning_wrapper.py", line 540, in _run
self._save_model_steps)
File "/home/work/user-job-dir/webvision-train-code/moxing/tensorflow/executor/learning.py", line 878, in run
self.training()
File "/home/work/user-job-dir/webvision-train-code/moxing/tensorflow/executor/learning.py", line 1530, in training
self.train_step(sess)
File "/home/work/user-job-dir/webvision-train-code/moxing/tensorflow/executor/learning.py", line 1274, in train_step
feed_dict=feed_dict)
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 567, in run
run_metadata=run_metadata)
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1043, in run
run_metadata=run_metadata)
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1119, in run
return self._sess.run(*args, **kwargs)
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1191, in run
run_metadata=run_metadata)
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 971, in run
return self._sess.run(*args, **kwargs)
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
and gdb bt of this process are
(gdb) bt full
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1 0x00007f5971f7da54 in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) ()
from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#2 0x00007f5971f7d221 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) ()
from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#3 0x00007f5971f7a764 in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) () from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#4 0x00007f5971f7ac85 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) ()
from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#5 0x00007f5971f80f2b in tensorflow::DirectSession::WaitForNotification(tensorflow::DirectSession::RunState*, tensorflow::CancellationManager*, long long)
() from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#6 0x00007f5971f85010 in tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*) ()
from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#7 0x00007f5971f8e3d5 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#8 0x00007f596f424c8a in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, TF_Tensor**, std::vector<std::string, std::allocator<std::string> > const&, TF_Buffer*, TF_Status*) ()
from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#9 0x00007f596f425886 in TF_SessionRun () from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#10 0x00007f596f0d4186 in tensorflow::TF_SessionRun_wrapper_helper(TF_Session*, char const*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Outp---Type <return> to continue, or q <return> to quit---
ut> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) ()
from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#11 0x00007f596f0d42ca in tensorflow::TF_SessionRun_wrapper(TF_Session*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) ()
from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
No symbol table info available.
#12 0x00007f596f090a6e in _wrap_TF_SessionRun_wrapper ()
from /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
and it seems it wait at the direct_session.cc, L552
WaitForNotification(&run_state, &step_cancellation_manager,
run_options.timeout_in_ms() > 0
? run_options.timeout_in_ms()
: operation_timeout_in_ms_);
but i never set the run_options.timeout_in_ms and config.operation_timeout_in_ms, and it seems it wait for the deadline in gdb bt, so the question is how can this forever hang happen when the nsync::nsync_cv_wait_with_deadline be called ? (at least for now, it runs almost three days …)
oh, i know why … the gdb lack of some output (as it lack of symbols …) ref https://github.com/tensorflow/tensorflow/issues/26559
nsync::nsync_cv_wait -> nsync::nsync_cv_wait_with_deadline
Describe the expected behavior
Not hang at the middle of training, the final train log are
INFO:tensorflow:step: 162300(global step: 162300) sample/sec: 368.464 ent_loss: 4.997 top-1: 0.375 top-5: 0.547 reg_loss: 0.309 total_loss: 5.306
INFO:tensorflow:step: 162300(global step: 162300) sample/sec: 368.328 ent_loss: 4.942 top-1: 0.344 top-5: 0.531 reg_loss: 0.309 total_loss: 5.251
INFO:tensorflow:step: 162300(global step: 162300) sample/sec: 368.118 ent_loss: 4.036 top-1: 0.578 top-5: 0.672 reg_loss: 0.309 total_loss: 4.344
INFO:tensorflow:step: 162300(global step: 162300) sample/sec: 367.937 ent_loss: 4.518 top-1: 0.453 top-5: 0.578 reg_loss: 0.309 total_loss: 4.827
INFO:tensorflow:global_step/sec: 1.46428
INFO:tensorflow:step: 162300(global step: 162300) sample/sec: 367.743 ent_loss: 4.083 top-1: 0.500 top-5: 0.703 reg_loss: 0.309 total_loss: 4.392
INFO:tensorflow:global_step/sec: 1.46428
INFO:tensorflow:step: 162300(global step: 162300) sample/sec: 286.646 ent_loss: 3.888 top-1: 0.484 top-5: 0.641 reg_loss: 0.309 total_loss: 4.197
INFO:tensorflow:step: 162300(global step: 162300) sample/sec: 286.646 ent_loss: 3.888 top-1: 0.484 top-5: 0.641 reg_loss: 0.309 total_loss: 4.197
Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.
not sure …
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
all nodes GPU utils is 100%
[root@job1b-pub-v100-5wh5z ~]# /var/paas/nvidia/bin/nvidia-smi
Sun May 12 20:19:41 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.93 Driver Version: 410.93 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:2D:00.0 Off | 0 |
| N/A 41C P0 65W / 300W | 29952MiB / 32480MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:32:00.0 Off | 0 |
| N/A 40C P0 71W / 300W | 29954MiB / 32480MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:5B:00.0 Off | 0 |
| N/A 42C P0 69W / 300W | 29942MiB / 32480MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:5F:00.0 Off | 0 |
| N/A 37C P0 64W / 300W | 29942MiB / 32480MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:B5:00.0 Off | 0 |
| N/A 40C P0 67W / 300W | 29954MiB / 32480MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... Off | 00000000:BE:00.0 Off | 0 |
| N/A 39C P0 66W / 300W | 29954MiB / 32480MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... Off | 00000000:E1:00.0 Off | 0 |
| N/A 41C P0 67W / 300W | 29954MiB / 32480MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... Off | 00000000:E9:00.0 Off | 0 |
| N/A 41C P0 67W / 300W | 29954MiB / 32480MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 71807 C /home/work/anaconda3/bin/python 29939MiB |
| 1 71808 C /home/work/anaconda3/bin/python 29939MiB |
| 2 71809 C /home/work/anaconda3/bin/python 29927MiB |
| 3 71810 C /home/work/anaconda3/bin/python 29927MiB |
| 4 71811 C /home/work/anaconda3/bin/python 29939MiB |
| 5 71812 C /home/work/anaconda3/bin/python 29939MiB |
| 6 71813 C /home/work/anaconda3/bin/python 29939MiB |
| 7 71814 C /home/work/anaconda3/bin/python 29939MiB |
+-----------------------------------------------------------------------------+
and dmesg output the segment fault info
[Sun May 12 19:53:37 2019] sh[429661]: segfault at 7fff7f09f428 ip 00007fff7f09f428 sp 00007fff7f09e078 error 15
[Sun May 12 19:54:37 2019] sh[433888]: segfault at 7ffe30bcccd8 ip 00007ffe30bcccd8 sp 00007ffe30bcb928 error 15
[Sun May 12 19:55:37 2019] sh[437936]: segfault at 7ffc5e4a0958 ip 00007ffc5e4a0958 sp 00007ffc5e49f5a8 error 15
[Sun May 12 19:56:37 2019] sh[442090]: segfault at 7ffc4ad375b8 ip 00007ffc4ad375b8 sp 00007ffc4ad36208 error 15
[Sun May 12 19:57:37 2019] sh[446313]: segfault at 7ffcba547148 ip 00007ffcba547148 sp 00007ffcba545d98 error 15
[Sun May 12 19:57:37 2019] sh[446409]: segfault at 7ffca4e2dc18 ip 00007ffca4e2dc18 sp 00007ffca4e2c868 error 15
[Sun May 12 19:58:37 2019] sh[450425]: segfault at 7ffd09ab86d8 ip 00007ffd09ab86d8 sp 00007ffd09ab7328 error 15
[Sun May 12 19:59:37 2019] sh[454603]: segfault at 7ffca20adf08 ip 00007ffca20adf08 sp 00007ffca20acb58 error 15
[Sun May 12 20:00:37 2019] sh[499]: segfault at 7ffc588dd7e8 ip 00007ffc588dd7e8 sp 00007ffc588dc438 error 15
[Sun May 12 20:01:37 2019] sh[4810]: segfault at 7ffe67cdfbd8 ip 00007ffe67cdfbd8 sp 00007ffe67cde828 error 15
[Sun May 12 20:02:37 2019] sh[8913]: segfault at 7ffc6fd6a3b8 ip 00007ffc6fd6a3b8 sp 00007ffc6fd69008 error 15
[Sun May 12 20:02:37 2019] sh[8973]: segfault at 7ffe613b4388 ip 00007ffe613b4388 sp 00007ffe613b2fd8 error 15
[Sun May 12 20:03:37 2019] sh[13161]: segfault at 7fff17ac1bd8 ip 00007fff17ac1bd8 sp 00007fff17ac0828 error 15
[Sun May 12 20:04:37 2019] sh[17308]: segfault at 7fff5fe44448 ip 00007fff5fe44448 sp 00007fff5fe43098 error 15
[Sun May 12 20:05:37 2019] sh[21475]: segfault at 7ffeca86ff58 ip 00007ffeca86ff58 sp 00007ffeca86eba8 error 15
[Sun May 12 20:06:37 2019] sh[25887]: segfault at 7ffea93cd2a8 ip 00007ffea93cd2a8 sp 00007ffea93cbef8 error 15
[Sun May 12 20:07:37 2019] sh[30090]: segfault at 7ffe9a32ba78 ip 00007ffe9a32ba78 sp 00007ffe9a32a6c8 error 15
[Sun May 12 20:07:37 2019] sh[30151]: segfault at 7ffe0e8b00e8 ip 00007ffe0e8b00e8 sp 00007ffe0e8aed38 error 15
[Sun May 12 20:08:37 2019] sh[34257]: segfault at 7fff46231e38 ip 00007fff46231e38 sp 00007fff46230a88 error 15
[Sun May 12 20:09:37 2019] sh[38356]: segfault at 7ffc8b9d2ff8 ip 00007ffc8b9d2ff8 sp 00007ffc8b9d1c48 error 15
[Sun May 12 20:10:37 2019] sh[42685]: segfault at 7ffd19c3c068 ip 00007ffd19c3c068 sp 00007ffd19c3acb8 error 15
[Sun May 12 20:11:37 2019] sh[46730]: segfault at 7ffc11dcc218 ip 00007ffc11dcc218 sp 00007ffc11dcae68 error 15
[Sun May 12 20:12:37 2019] sh[50885]: segfault at 7ffde26e73c8 ip 00007ffde26e73c8 sp 00007ffde26e6018 error 15
[Sun May 12 20:12:37 2019] sh[51091]: segfault at 7ffcaaf42788 ip 00007ffcaaf42788 sp 00007ffcaaf413d8 error 15
[Sun May 12 20:13:37 2019] sh[55116]: segfault at 7ffc9faf70a8 ip 00007ffc9faf70a8 sp 00007ffc9faf5cf8 error 15
[Sun May 12 20:14:37 2019] sh[59159]: segfault at 7fff7ed38518 ip 00007fff7ed38518 sp 00007fff7ed37168 error 15
[Sun May 12 20:15:37 2019] sh[63389]: segfault at 7ffcf8b8d068 ip 00007ffcf8b8d068 sp 00007ffcf8b8bcb8 error 15
[Sun May 12 20:16:37 2019] sh[67564]: segfault at 7fffc0e890d8 ip 00007fffc0e890d8 sp 00007fffc0e87d28 error 15
[Sun May 12 20:17:37 2019] sh[71615]: segfault at 7ffd22222168 ip 00007ffd22222168 sp 00007ffd22220db8 error 15
[Sun May 12 20:17:37 2019] sh[71666]: segfault at 7ffc62aff5e8 ip 00007ffc62aff5e8 sp 00007ffc62afe238 error 15
[Sun May 12 20:18:37 2019] sh[75987]: segfault at 7ffdf614bd88 ip 00007ffdf614bd88 sp 00007ffdf614a9d8 error 15
[Sun May 12 20:19:37 2019] sh[80035]: segfault at 7ffffcbe7c38 ip 00007ffffcbe7c38 sp 00007ffffcbe6888 error 15
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16 (9 by maintainers)
@jvishnuvardhan , sorry for the late reply, we suppose it is resolved now after upgrading the nccl to 2.4.7
thx @alsrgv ~ i will tell my colleagues for this action, they have been stuck at this problem for too long since last time i posted this issue … -_-# i didn’t have much time to follow it
@alsrgv thx for reply 😃, i will try,
horovodis quite a interesting framework, i would like to dive into it ~and, to clarify … the segment faults in my previous post are unrelated …
we found that one of our agents periodly call the monitor shell script (sh), but this script failed to execute as it lack of some sys libs …