HugeCTR: [BUG] SparseOperationKit hangs on initialization

Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

Build image based on gcr.io/deeplearning-platform-release/tf2-gpu.2-5 and install SOK

FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-5

COPY ./hugectr /usr/src/app/
WORKDIR /usr/src/app/sparse_operation_kit
# pythonpath setting in the ./install.sh fails, so we just exit even if we fail
RUN ./install.sh --SM="70;75;80" --USE_NVTX=OFF; exit 0
ENV PYTHONPATH "/usr/local/lib/:${PYTHONPATH}"
WORKDIR /usr/src/app

Run this code

import os
from absl import flags, app, logging
import tensorflow as tf

import numpy as np

import sparse_operation_kit as sok

flags.DEFINE_integer('num_items', 1024, 'Number of items in embedding.')

FLAGS = flags.FLAGS

batch_size = 4096

os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, range(16)))

def gen():
  while True:
    users = tf.random.uniform([batch_size], 0, FLAGS.num_items, tf.int32)
    items = tf.random.uniform([batch_size], 0, FLAGS.num_items, tf.int32)
    yield users, items, tf.random.normal([batch_size])


class Model(tf.keras.Model):

  def __init__(self):
    super().__init__()

    self._embeddings = sok.DistributedEmbedding(
      combiner='mean',
      max_vocabulary_size_per_gpu=1024,
      embedding_vec_size=256,
      slot_num=2,
      max_nnz=2,
    )

  def call(self, inputs, training=False, mask=None):
    # Whatever the lookup is.
    logging.info(f'user: {inputs[0].shape}, {inputs[0].device}')
    return self._embeddings(tf.concat([inputs[0], inputs[1]], axis=1))


def main(_):
  gpus = tf.config.list_physical_devices('GPU')
  logging.info(f'Found {gpus}')
  for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

  # strategy = tf.distribute.MirroredStrategy(['gpu:0'], cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
  # strategy = tf.distribute.MirroredStrategy(['gpu:0'])
  # strategy = tf.distribute.MirroredStrategy()
  strategy = tf.distribute.MirroredStrategy(['gpu:0'], cross_device_ops=tf.distribute.NcclAllReduce())

  ds = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32, tf.float32))
  ds = ds.prefetch(10)

  with strategy.scope():
    logging.info(f'Initializing sok.')
    result = sok.Init(global_batch_size=1024)
    model = Model()
    emb_opt = tf.keras.optimizers.SGD(0.001)
    dense_opt = tf.keras.optimizers.SGD(0.001)

  # more code that is never reached
 
if __name__ == '__main__':
  app.run(main)

Run this on an A100 with 16 GPUs.

Expected behavior This should proceed past the sok.Init but it does not. I get the usual tf initialization and GPU discovery logs and then I hit "Initializing sok." It does not proceed beyond the logs I have shown here.

I1109 10:07:18.764635 139666590381888 model.py:56] Initializing sok.
2021-11-09 10:07:20.041804: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-09 10:07:20.042466: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2200210000 Hz
You are using the plugin with MirroredStrategy.
hugectr-chief-0:1:1 [0] NCCL INFO Bootstrap : Using eth0:7.12.81.17<0>
hugectr-chief-0:1:1 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Tx CPU start: -2
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Rx CPU start: -2
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Flow placement enabled.
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : queue skip: 0
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Using [0]eth0:7.12.81.17<0>
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket plugin initialized
hugectr-chief-0:1:1 [0] NCCL INFO Using network FastSocket
2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:81] Global seed is 314905248
2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:82] Local GPU Count: 16
2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:83] Global GPU Count: 1
2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:97] Global Replica Id: 0; Local Replica Id: 0
NCCL version 2.10.3+cuda11.0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 00/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 01/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 02/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 03/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 04/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 05/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 06/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 07/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 08/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 09/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 10/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 11/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 12/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 13/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 14/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 15/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 16/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 17/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 18/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 19/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 20/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 21/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 22/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 23/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 24/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 25/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 26/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 27/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 28/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 29/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 30/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 31/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
hugectr-chief-0:1:251 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
hugectr-chief-0:1:251 [0] NCCL INFO Connected all rings
hugectr-chief-0:1:251 [0] NCCL INFO Connected all trees
hugectr-chief-0:1:251 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
hugectr-chief-0:1:251 [0] NCCL INFO comm 0x7efdbc00da30 rank 0 nranks 1 cudaDev 0 busId 40 - Init COMPLETE
2021-11-09 10:07:20.979918: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10

None of the strategy instantiations shown above proceed beyond this point:

all GPUs vs single GPUs
hierarchical vs nccl

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    58W / 400W |    714MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

with 16 GPUs.

I also made sure to give the container a /dev/shm of 16Gi.

Additional context Add any other context about the problem here.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 53 (19 by maintainers)

Most upvoted comments

i got past this by specifically passing the 16 gpus into the strategy. the default unspecified devices did not work.

rllin on Nov 10, 2021

Hey Randall, The 21.12 container is now available here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow-training

Let us know if this resolved your issue.

EvenOldridge on Dec 11, 2021

We can reproduce it and we are trying to figure out why. Will come back to you when we find a solution.

Jianbing-D on Nov 17, 2021