HugeCTR: [BUG] SparseOperationKit hangs on initialization

Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

  1. Build image based on gcr.io/deeplearning-platform-release/tf2-gpu.2-5 and install SOK
FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-5

COPY ./hugectr /usr/src/app/
WORKDIR /usr/src/app/sparse_operation_kit
# pythonpath setting in the ./install.sh fails, so we just exit even if we fail
RUN ./install.sh --SM="70;75;80" --USE_NVTX=OFF; exit 0
ENV PYTHONPATH "/usr/local/lib/:${PYTHONPATH}"
WORKDIR /usr/src/app
  1. Run this code
import os
from absl import flags, app, logging
import tensorflow as tf

import numpy as np

import sparse_operation_kit as sok

flags.DEFINE_integer('num_items', 1024, 'Number of items in embedding.')

FLAGS = flags.FLAGS

batch_size = 4096

os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, range(16)))

def gen():
  while True:
    users = tf.random.uniform([batch_size], 0, FLAGS.num_items, tf.int32)
    items = tf.random.uniform([batch_size], 0, FLAGS.num_items, tf.int32)
    yield users, items, tf.random.normal([batch_size])


class Model(tf.keras.Model):

  def __init__(self):
    super().__init__()

    self._embeddings = sok.DistributedEmbedding(
      combiner='mean',
      max_vocabulary_size_per_gpu=1024,
      embedding_vec_size=256,
      slot_num=2,
      max_nnz=2,
    )

  def call(self, inputs, training=False, mask=None):
    # Whatever the lookup is.
    logging.info(f'user: {inputs[0].shape}, {inputs[0].device}')
    return self._embeddings(tf.concat([inputs[0], inputs[1]], axis=1))


def main(_):
  gpus = tf.config.list_physical_devices('GPU')
  logging.info(f'Found {gpus}')
  for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

  # strategy = tf.distribute.MirroredStrategy(['gpu:0'], cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
  # strategy = tf.distribute.MirroredStrategy(['gpu:0'])
  # strategy = tf.distribute.MirroredStrategy()
  strategy = tf.distribute.MirroredStrategy(['gpu:0'], cross_device_ops=tf.distribute.NcclAllReduce())

  ds = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32, tf.float32))
  ds = ds.prefetch(10)

  with strategy.scope():
    logging.info(f'Initializing sok.')
    result = sok.Init(global_batch_size=1024)
    model = Model()
    emb_opt = tf.keras.optimizers.SGD(0.001)
    dense_opt = tf.keras.optimizers.SGD(0.001)

  # more code that is never reached
 
if __name__ == '__main__':
  app.run(main)
  1. Run this on an A100 with 16 GPUs.

Expected behavior This should proceed past the sok.Init but it does not. I get the usual tf initialization and GPU discovery logs and then I hit "Initializing sok." It does not proceed beyond the logs I have shown here.

I1109 10:07:18.764635 139666590381888 model.py:56] Initializing sok.
2021-11-09 10:07:20.041804: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-09 10:07:20.042466: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2200210000 Hz
You are using the plugin with MirroredStrategy.
hugectr-chief-0:1:1 [0] NCCL INFO Bootstrap : Using eth0:7.12.81.17<0>
hugectr-chief-0:1:1 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Tx CPU start: -2
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Rx CPU start: -2
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Flow placement enabled.
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : queue skip: 0
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Using [0]eth0:7.12.81.17<0>
hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket plugin initialized
hugectr-chief-0:1:1 [0] NCCL INFO Using network FastSocket
2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:81] Global seed is 314905248
2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:82] Local GPU Count: 16
2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:83] Global GPU Count: 1
2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:97] Global Replica Id: 0; Local Replica Id: 0
NCCL version 2.10.3+cuda11.0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 00/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 01/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 02/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 03/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 04/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 05/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 06/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 07/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 08/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 09/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 10/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 11/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 12/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 13/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 14/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 15/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 16/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 17/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 18/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 19/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 20/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 21/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 22/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 23/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 24/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 25/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 26/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 27/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 28/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 29/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 30/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Channel 31/32 :    0
hugectr-chief-0:1:251 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
hugectr-chief-0:1:251 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
hugectr-chief-0:1:251 [0] NCCL INFO Connected all rings
hugectr-chief-0:1:251 [0] NCCL INFO Connected all trees
hugectr-chief-0:1:251 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
hugectr-chief-0:1:251 [0] NCCL INFO comm 0x7efdbc00da30 rank 0 nranks 1 cudaDev 0 busId 40 - Init COMPLETE
2021-11-09 10:07:20.979918: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10

None of the strategy instantiations shown above proceed beyond this point:

  • all GPUs vs single GPUs
  • hierarchical vs nccl

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    58W / 400W |    714MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

with 16 GPUs.

I also made sure to give the container a /dev/shm of 16Gi.

Additional context Add any other context about the problem here.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 53 (19 by maintainers)

Most upvoted comments

i got past this by specifically passing the 16 gpus into the strategy. the default unspecified devices did not work.

Hey Randall, The 21.12 container is now available here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow-training

Let us know if this resolved your issue.

We can reproduce it and we are trying to figure out why. Will come back to you when we find a solution.