tensorflow: Training with GPU on TF 2.0 is much slower than on TF 1.14 if set a large number to `input_dim` of `tf.keras.layers.Embedding`

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux-3.10.0-957.21.3.el7.x86_64 CentOS-7.3.1611-Core
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: None
  • TensorFlow installed from (source or binary): binary, pip install tensorflow-gpu
  • TensorFlow version (use command below): 2.0.0-rc0(v2.0.0-beta1-5101-gc75bb66), 1.14.0(v1.14.0-rc1-22-gaf24dc91b5)
  • Python version: 3.6.8
  • Bazel version (if compiling from source): None
  • GCC/Compiler version (if compiling from source): None
  • CUDA/cuDNN version: CUDA 10.0.130, cuDNN 7.6.3.30
  • GPU model and memory: RTX 2070 Super, 8GB

Describe the current behavior
I converted the Keras implementation of Neural Matrix Factorization (NeuMF) to tf.keras and it works well on TF 1.14.
But when I run it on TF 2.0.0-rc0, the training is much slower than on TF 1.14.
I use the profiling tools to check the time, and I found ReadVariableOp takes too much time if I set a large number to the input_dim of tf.keras.layers.Embedding.

Tensorflow version:  2.0.0-rc0
Epoch 1/3
10000/10000 [==============================] - 5s 532us/sample - loss: 0.6935
Epoch 2/3
10000/10000 [==============================] - 4s 436us/sample - loss: 0.6903
Epoch 3/3
10000/10000 [==============================] - 4s 431us/sample - loss: 0.6851
Tensorflow version:  1.14.0
Epoch 1/3
10000/10000 [==============================] - 2s 212us/sample - loss: 0.7035
Epoch 2/3
10000/10000 [==============================] - 0s 28us/sample - loss: 0.6981
Epoch 3/3
10000/10000 [==============================] - 0s 29us/sample - loss: 0.6909

Describe the expected behavior
The speed of training on TF 2.0 with large input_dim of Embedding should be the same as TF 1.14 or faster.

Code to reproduce the issue
I have shared the codes on Colab or check the codes below.

# -*- coding:utf-8 -*-

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.regularizers import l1, l2
from tensorflow.keras.layers import Embedding, Input, Dense, Lambda, Flatten

def get_model(num_users, num_items, mf_dim=10, layers=[10], reg_layers=[0], reg_mf=0, alpha=0.5):
  assert len(layers) == len(reg_layers)
  num_layer = len(layers) #Number of layers in the MLP
  
  # Input variables
  user_input = Input(shape=(1,), dtype='int32', name = 'user_input')
  item_input = Input(shape=(1,), dtype='int32', name = 'item_input')
  
  # Embedding layer
  MF_Embedding_User = Embedding(input_dim = num_users, output_dim = mf_dim, name = 'mf_embedding_user', 
                                embeddings_initializer = keras.initializers.RandomNormal(mean=0.0, stddev=0.01, seed=None), embeddings_regularizer = l2(reg_mf), 
                                input_length=1)
  MF_Embedding_Item = Embedding(input_dim = num_items, output_dim = mf_dim, name = 'mf_embedding_item', 
                                embeddings_initializer = keras.initializers.RandomNormal(mean=0.0, stddev=0.01, seed=None), embeddings_regularizer = l2(reg_mf), 
                                input_length=1)

  MLP_Embedding_User = Embedding(input_dim = num_users, output_dim = int(layers[0]/2), name = "mlp_embedding_user", 
                                  embeddings_initializer = keras.initializers.RandomNormal(mean=0.0, stddev=0.01, seed=None), embeddings_regularizer = l2(reg_layers[0]), 
                                  input_length=1)
  MLP_Embedding_Item = Embedding(input_dim = num_items, output_dim = int(layers[0]/2), name = 'mlp_embedding_item', 
                                  embeddings_initializer = keras.initializers.RandomNormal(mean=0.0, stddev=0.01, seed=None), embeddings_regularizer = l2(reg_layers[0]), 
                                  input_length=1)

  # MF part
  mf_user_latent = Flatten()(MF_Embedding_User(user_input))
  mf_item_latent = Flatten()(MF_Embedding_Item(item_input))
  mf_vector = keras.layers.Multiply()([mf_user_latent, mf_item_latent])

  # MLP part
  mlp_user_latent = Flatten()(MLP_Embedding_User(user_input))
  mlp_item_latent = Flatten()(MLP_Embedding_Item(item_input))
  mlp_vector = keras.layers.Concatenate(axis=-1)([mlp_user_latent, mlp_item_latent])

  for idx in range(1, num_layer):
    mlp_vector = Dense(layers[idx], 
                      activation='relu', 
                      kernel_regularizer = l2(reg_layers[idx]), 
                      bias_regularizer = l2(reg_layers[idx]), 
                      name="layer%d" %idx)(mlp_vector)

  # Concatenate MF and MLP parts
  mf_vector = Lambda(lambda x: x * alpha)(mf_vector)
  mlp_vector = Lambda(lambda x : x * (1 - alpha))(mlp_vector)
  predict_vector = keras.layers.Concatenate(axis=-1)([mf_vector, mlp_vector])

  # Final prediction layer
  prediction = Dense(1, 
                    activation='sigmoid', 
                    kernel_initializer='lecun_uniform', 
                    bias_initializer ='lecun_uniform', 
                    name = "prediction")(predict_vector)

  model = keras.Model(inputs=[user_input, item_input], outputs=[prediction])
  return model

def generate_data(num_user, num_item, count=100):
    user_input = []
    item_input = []
    labels = []
    for _ in range(count):
        user = np.random.randint(0,num_user)
        item = np.random.randint(0,num_item)
        label = np.random.randint(0,2)
        user_input.append(user)
        item_input.append(item)
        labels.append(label)
    return np.asarray(user_input), np.asarray(item_input), np.asarray(labels)

def test_model():
    num_user = 1000000
    num_item = 100000
    count = 10000
    user_input, item_input, labels = generate_data(num_user, num_item, count)

    model = get_model(num_user, num_item)
    model.compile(
        optimizer=tf.keras.optimizers.Adam(),
        loss=tf.keras.losses.BinaryCrossentropy()
    )

    # Callbacks
    callbacks = [ tf.keras.callbacks.TensorBoard(log_dir='tb-logs') ]
    model.fit([user_input, item_input], labels, batch_size=256, epochs=3, callbacks=callbacks)

if __name__ == "__main__":
    print("Tensorflow version: ", tf.__version__)
    test_model()

Other info / logs
The attachment ‘tb-logs.zip’ is the tensorboard logs.

The profiling screenshot of the training on TF 2.0.0-rc0.
tf2-profile

The profiling screenshot of the training on TF 1.14.
tf114-profile

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 27 (7 by maintainers)

Most upvoted comments

There seems to be a significant slowdown generally when using TF2 fit_generator. It seems to be around a 3x slowdown in my own code between TF1 and TF2. It is easy to reproduce using the “Transfer Learning with TFHub” example from the TF2 official tutorials on Collab: https://www.tensorflow.org/beta/tutorials/images/hub_with_keras

To reproduce it, all I did was change model.fit to model.fit_generator. I ran these cases for both TF2 and TF1. TF1 is via this change to the first code cell: %tensorflow_version 1.x

Here’s the training runs for the four cases:

TF2 fit
Epoch 1/2
115/115 [======] - 24s 212ms/step - loss: 0.6619 - acc: 0.9062
Epoch 2/2
115/115 [======] - 20s 178ms/step - loss: 0.3309 - acc: 0.8125

TF2 fit_generator
115/115 [======] - 56s 485ms/step - loss: 0.6666 - acc: 0.9375
Epoch 2/2
115/115 [======] - 49s 424ms/step - loss: 0.3345 - acc: 0.9688

TF1 fit
Epoch 1/2
115/115 [======] - 16s 136ms/step - loss: 0.6406 - acc: 0.8750
Epoch 2/2
115/115 [======] - 15s 129ms/step - loss: 0.3279 - acc: 0.8750

TF1 fit_generator
Epoch 1/2
115/115 [======] - 16s 139ms/step - loss: 0.7300 - acc: 0.8125
Epoch 2/2
115/115 [======] - 15s 132ms/step - loss: 0.3492 - acc: 0.9062

With TF1, there is no difference between fit and fit_geneator as you might hope. TF2 seems slower in general and fit_generator in particular is 3x slower than TF1–at least for this tutorial and my own code.

BTW, Collab is using TF2 RC1 at the moment:

tf.__version__
'2.0.0-rc1'

Thanks to all.

I’ve tried tf.compat.v1.disable_eager_execution() and model.fit(x=generator, ...) with and without tf.distribute.MirroredStrategy but no help.
I think the key problem is the large input_dim of tf.keras.layers.Embedding and training with the generator.

The following cases are all tested with GPU on TF 2.0.0rc2 compared with TF 1.14.

  1. Small input_dim, model.fit without generator, without tf.distribute.MirroredStrategy. [Fast]
  2. Large input_dim, model.fit without generator, without tf.distribute.MirroredStrategy. [Slow]
  3. Large input_dim, model.fit without generator, with tf.distribute.MirroredStrategy. [Fast]
  4. Large input_dim, model.fit with generator, without tf.distribute.MirroredStrategy. [Slow]
  5. Large input_dim, model.fit with generator, with tf.distribute.MirroredStrategy. [Slow]

Here the pseudo code I have tried.

import tensorflow as tf

# `disable_eager_execution` conflicts with `tf.distribute.MirroredStrategy`. 
# "AssertError: assert isinstance(x, dataset_ops.DatasetV2)" will be raised.
# tf.compat.v1.disable_eager_execution() 

from tensorflow.keras.utils import Sequence

class MyGenerator(Sequence):
    def __init__(self, ...):
        # do something

    def __iter__(self):
        return self

    def __len__(self):
        return batches

    def __getitem__(self, index):
        # do something
        return tuple([x1, x2, ...]), y

def train():
    strategy = tf.distribute.MirroredStrategy()
    print ('Number of devices: {}'.format(strategy.num_replicas_in_sync))
    with strategy.scope():
        model = ... # Create model
        model.compile(...)
        
        my_gen = MyGenerator(...) # Create generator

        model.fit(
            x=my_gen,
            # Do not specify `y` and `batch_size` if x is a dataset, generator, or keras.utils.Sequence instance
            # Other arguments are the same as `fit_generator`'s
            ...
        )

if __name__ == '__main__':
    train()

@tabacof Hi, I created the other issue linked.

There are two current resolutions to the issues people were having there; Adding the line tf.compat.v1.disable_eager_execution() right after import tf or switching to model.fit( when using TF 2.0

You say that you don’t believe the issue is because of eager execution, the easiest way to prove that is to add the first fix tf.compat.v1.disable_eager_execution() right after importing TF. If this does not improved performance, then that should put to rest the eager execution argument.

The issue I encountered is that fit_generator is kicking it into eager execution no matter what as robeita says. And in my experience, eager execution is much slower in every case.

I hope that this can help you isolate the issue so that the cause can be identified.

Having the same problem. I have upgraded to TF2.2. Moreover, I use the CPU to run the code. Through analyzing the timeprofile, I also find the embedding_lookup ReadVariableOp takes the too much time.

I have the same problem,Tensorflow version is 2.5.0. ReadVariableOp took a long time,and unstable memory usage.

Just upgraded to TF 2.2 . Having huge performance issues with keras models.

Update : Disabled eager execution and changed all my inputs to the model.predict() function from tensors to numpy arrays. Looks like its much faster now.

Update 2 : Nevermind. Im switching back to 1.14. My summary writer dosent work if i use tf.compat.v1.disable_eager_execution(). Have no time to deal with all these bugs.

Can confirm with @cupdike . I had a similar issue with my own project when switching to TF2 (stable. I waited for the official release a couple days ago), with a 2x to 3x decrease in training time for the same data and code, as compared to TF1. After some Google searching and reading, I then proceeded to implement the code using tf.data.Dataset.from_generator(), instead, which allows me to use model.fit().

Unfortunately there was 0 performance benefit either way.

As for some pseudocode (posting here just in case someone can point out something fundamentally wrong with my setup), my fit_generator version of my code went something like this below. All my code uses the internal tf.keras instead of the external one:

def datagen(args):
    while True:
        #some code here to load and manipulate data into x and y. Mostly numpy functions
        yield x,y

#some here code to create and compile model 

model.fit_generator(datagen(args), . . . )

For the pseudocode using tf.data.Dataset.from_generator():

from tensorflow.compat.v2.data import Dataset

def datagen(args):
    while True:
        #some code here to load and manipulate data into x and y. Mostly numpy functions
        yield x,y

#some code here to create and compile model 

train_data = Dataset.from_generator(generator=lambda: datagen(args), . . . )
model.fit(train_data , . . . )