tensorflow: multi-device function optimization failure

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):ubuntu 19.10
  • TensorFlow installed from (source or binary):github release
  • TensorFlow version (use command below):2.0
  • Python version:3.7.5
  • Bazel version (if compiling from source):0.26.1
  • GCC/Compiler version (if compiling from source):9.2.1 20191008
  • CUDA/cuDNN version:10.1, 7.6.4
  • GPU model and memory: RTX2080Ti x2 NVLink

Describe the current behavior It takes to much time and shows W:“multi-device function optimization failure”

Describe the expected behavior Fastly start train after compile the model with no W.

Code to reproduce the issue

main:

import os
os.environ['TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_IGNORE_PERFORMANCE'] = '1'
import tensorflow as tf
from Models import MCN
from DataSets import ImageNet

for gpu in tf.config.experimental.list_physical_devices('GPU'):
    tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')

BATCH_SIZE=20
BATCHS_PER_APLY_GRADIENTS=1000//BATCH_SIZE

ds=ImageNet.ImageNetP()
starategy=tf.distribute.MirroredStrategy()
with starategy.scope():
    model=MCN.mcn_520(2,24)
    model.summary()
    model.compile(
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        optimizer=tf.keras.optimizers.SGD(),
        metrics=[tf.keras.metrics.TopKCategoricalAccuracy(1,'top1'),tf.keras.metrics.TopKCategoricalAccuracy(5,'top5')]
        )
    fit_ds,val_ds=ds(BATCH_SIZE)
    model.fit(
        fit_ds,
        epochs=1000000,
        steps_per_epoch=BATCHS_PER_APLY_GRADIENTS*200,
        validation_data=val_ds,
        validation_steps=ds.val_images//BATCH_SIZE,
    )

MCN.py:

import math
import tensorflow as tf
from tensorflow import keras

class Swish(keras.layers.Layer):
    def __init__(self):
        super(Swish, self).__init__()
        self.weight = self.add_weight(initializer='uniform',trainable=True)

    def __call__(self, inputs):
        return inputs+tf.sigmoid(self.weight*inputs)


class Conv(keras.Model):
    def __init__(self,filters,kernel_size=1,strides=1,padding='valid'):
        super(Conv, self).__init__()
        self.conv = keras.layers.Conv2D(filters,kernel_size,strides,padding)
        self.bn = keras.layers.BatchNormalization()
        self.ac = Swish()

    def __call__(self,inputs):
        return self.ac(self.bn(self.conv(inputs)))


class SEBlock(keras.Model):
    def __init__(self, filters):
        super(SEBlock, self).__init__()
        self.conv0 = keras.layers.Conv2D(filters//4,1,1)
        self.drop = keras.layers.Dropout(0.25)
        self.conv1 = keras.layers.Conv2D(filters,1,1)
        self.bn = keras.layers.BatchNormalization()
        self.ac = Swish()

    def __call__(self,inputs):
        x = self.conv1(self.drop(self.conv0(tf.reduce_mean(inputs,[1,2],keepdims=True))))
        return self.ac(self.bn(tf.sigmoid(x)*inputs))


class ResBlock(keras.Model):
    def __init__(self, filters):
        super(ResBlock, self).__init__()
        self.conv0 = keras.layers.Conv2D(filters//4,1,1)
        self.drop = keras.layers.Dropout(0.25)
        self.conv1 = keras.layers.Conv2D(filters,3,1,'same')
        self.bn = keras.layers.BatchNormalization()
        self.ac = Swish()

    def __call__(self,inputs):
        x = self.conv1(self.drop(self.conv0(inputs)))
        return self.ac(self.bn(inputs+x))

def mcn_520(width, growth,input_shape=[256,256,3]):
    fs = int(width*growth)
    inputs=keras.layers.Input(input_shape)
    x=keras.layers.Conv2D(fs,8,2)(inputs)
    x=keras.layers.MaxPool2D(2)(x)
    x1=Conv(fs//width)(SEBlock(fs)(x))
    x2=Conv(fs//width)(ResBlock(fs)(x))
    for i, depth in enumerate([2, 3, 5, 4]):
        for _ in range(int(6*depth)):
            fs+=int(math.sqrt(fs*width))
            t=keras.layers.Concatenate()([x,x1,x2])
            t=keras.layers.Dropout(0.25)(t)
            t=Conv(fs//width, 1, 1)(t)
            t=keras.layers.Dropout(0.25)(t)
            x1=SEBlock(fs//width)(t)
            x2=ResBlock(fs//width)(t)
            t=keras.layers.Concatenate()([t,x1,x2])
            t=keras.layers.Dropout(0.25)(t)
            t=Conv(growth,1,1)(t)
            x=keras.layers.Concatenate()([x,t])
        if i != 3:
            fs //= 2
            x=keras.layers.MaxPool2D(2)(Conv(fs)(x))
            x1=keras.layers.MaxPool2D(2)(Conv(fs//width)(x1))
            x2=keras.layers.MaxPool2D(2)(Conv(fs//width)(x2))
    x=keras.layers.GlobalMaxPool2D()(x)
    x=keras.layers.Dropout(0.25)(x)
    outputs=keras.layers.Dense(1000,activation='softmax')(x)
    return keras.Model(inputs=inputs,outputs=outputs,name='MCN520')

Other info / logs

Train for 10000 steps, validate for 2500 steps Epoch 1/1000000 2019-11-27 07:06:16.642667: W tensorflow/core/common_runtime/process_function_library_runtime.cc:675] Ignoring multi-device function optimization failure: Deadline exceeded: meta_optimizer exceeded deadline. 2019-11-27 07:06:27.657015: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2019-11-27 07:06:28.144415: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 17 (1 by maintainers)

Most upvoted comments

I have the same issue with tf 2.2.0 (with MirroredStrategy or without MirroredStrategy). If anyone has any ideas on how to fix the problem I would be very grateful (Number of inputs to my model is a few millions).

I am having the same issue. If anyone has any ideas on how to go about diagnosing (or fixing) the problem I would be extremely grateful

I have two gigabyte servers with 8 K20X GPUS. They are identical in pretty much every way. With the same model launched on both one of them executes fine with distribution enabled (~1.5s per epoch) the other has the optimization timeout (~6.5s per epoch).

System Details:

  • 2 X Intel E5-2650
  • 8 X NVIDIA K20X
  • 2 1TB SATA SSD configured in RAID 0
  • 128GB DDR3 memory
  • 256GB Swap
  • Ubuntu 18.04.4

Software Details

  • Python 3.6
  • TF 2.1.0-2.2.0

I have attempted the following actions with no success:

  • Re-install CUDA 10.1 from scratch
  • Re-install TF 2.2.0 from binary
  • Re-install TF 2.1.0 from binary
  • Build TF 2.2.0 from source

Yeah it looks like grappler is taking too long and timing out for the big graph - and is likely skipping some graph optimizations. @HLSS-Hen when you say it is slow, can you provide some numbers? is the time to start first step slow or is the actual training slow? Can you provide numbers for 1 GPU (No distribution) vs 2 GPUs with mirrored strategy?