HugeCTR: [BUG] SparseOperationKit hangs on initialization
Describe the bug A clear and concise description of what the bug is.
To Reproduce Steps to reproduce the behavior:
- Build image based on
gcr.io/deeplearning-platform-release/tf2-gpu.2-5
and install SOK
FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-5
COPY ./hugectr /usr/src/app/
WORKDIR /usr/src/app/sparse_operation_kit
# pythonpath setting in the ./install.sh fails, so we just exit even if we fail
RUN ./install.sh --SM="70;75;80" --USE_NVTX=OFF; exit 0
ENV PYTHONPATH "/usr/local/lib/:${PYTHONPATH}"
WORKDIR /usr/src/app
- Run this code
import os
from absl import flags, app, logging
import tensorflow as tf
import numpy as np
import sparse_operation_kit as sok
flags.DEFINE_integer('num_items', 1024, 'Number of items in embedding.')
FLAGS = flags.FLAGS
batch_size = 4096
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, range(16)))
def gen():
while True:
users = tf.random.uniform([batch_size], 0, FLAGS.num_items, tf.int32)
items = tf.random.uniform([batch_size], 0, FLAGS.num_items, tf.int32)
yield users, items, tf.random.normal([batch_size])
class Model(tf.keras.Model):
def __init__(self):
super().__init__()
self._embeddings = sok.DistributedEmbedding(
combiner='mean',
max_vocabulary_size_per_gpu=1024,
embedding_vec_size=256,
slot_num=2,
max_nnz=2,
)
def call(self, inputs, training=False, mask=None):
# Whatever the lookup is.
logging.info(f'user: {inputs[0].shape}, {inputs[0].device}')
return self._embeddings(tf.concat([inputs[0], inputs[1]], axis=1))
def main(_):
gpus = tf.config.list_physical_devices('GPU')
logging.info(f'Found {gpus}')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
# strategy = tf.distribute.MirroredStrategy(['gpu:0'], cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
# strategy = tf.distribute.MirroredStrategy(['gpu:0'])
# strategy = tf.distribute.MirroredStrategy()
strategy = tf.distribute.MirroredStrategy(['gpu:0'], cross_device_ops=tf.distribute.NcclAllReduce())
ds = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32, tf.float32))
ds = ds.prefetch(10)
with strategy.scope():
logging.info(f'Initializing sok.')
result = sok.Init(global_batch_size=1024)
model = Model()
emb_opt = tf.keras.optimizers.SGD(0.001)
dense_opt = tf.keras.optimizers.SGD(0.001)
# more code that is never reached
if __name__ == '__main__':
app.run(main)
- Run this on an A100 with 16 GPUs.
Expected behavior
This should proceed past the sok.Init
but it does not. I get the usual tf initialization and GPU discovery logs and then I hit "Initializing sok."
It does not proceed beyond the logs I have shown here.
I1109 10:07:18.764635 139666590381888 model.py:56] Initializing sok.
2021-11-09 10:07:20.041804: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-09 10:07:20.042466: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2200210000 Hz
You are using the plugin with MirroredStrategy.
hugectr-chief-0:1:1 [0] NCCL INFO Bootstrap : Using eth0:7.12.81.17<0>
hugectr-chief-0:1:1 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Tx CPU start: -2
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Rx CPU start: -2
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Flow placement enabled.
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : queue skip: 0
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Using [0]eth0:7.12.81.17<0>
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket plugin initialized
hugectr-chief-0:1:1 [0] NCCL INFO Using network FastSocket
2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:81] Global seed is 314905248
2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:82] Local GPU Count: 16
2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:83] Global GPU Count: 1
2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:97] Global Replica Id: 0; Local Replica Id: 0
NCCL version 2.10.3+cuda11.0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 00/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 01/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 02/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 03/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 04/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 05/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 06/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 07/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 08/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 09/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 10/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 11/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 12/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 13/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 14/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 15/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 16/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 17/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 18/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 19/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 20/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 21/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 22/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 23/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 24/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 25/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 26/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 27/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 28/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 29/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 30/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 31/32 : 0
hugectr-chief-0:1:251 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
hugectr-chief-0:1:251 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
hugectr-chief-0:1:251 [0] NCCL INFO Connected all rings
hugectr-chief-0:1:251 [0] NCCL INFO Connected all trees
hugectr-chief-0:1:251 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
hugectr-chief-0:1:251 [0] NCCL INFO comm 0x7efdbc00da30 rank 0 nranks 1 cudaDev 0 busId 40 - Init COMPLETE
2021-11-09 10:07:20.979918: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
None of the strategy
instantiations shown above proceed beyond this point:
- all GPUs vs single GPUs
- hierarchical vs nccl
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB Off | 00000000:00:04.0 Off | 0 |
| N/A 34C P0 58W / 400W | 714MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
with 16 GPUs.
I also made sure to give the container a /dev/shm
of 16Gi
.
Additional context Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 53 (19 by maintainers)
i got past this by specifically passing the 16 gpus into the strategy. the default unspecified devices did not work.
Hey Randall, The 21.12 container is now available here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow-training
Let us know if this resolved your issue.
We can reproduce it and we are trying to figure out why. Will come back to you when we find a solution.