tensorflow-upstream: Slow down on LSTM since 2.3

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 9
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): tensorflow-rocm (pip)
  • TensorFlow version (use command below): b’v1.13.1-691-gf092438’ 1.13.1
  • Python version: 3.6.7
  • GCC/Compiler version (if compiling from source): 6.3
  • ROCm/MIOpen version: 2.2 vs latest
  • GPU model and memory: Dual VEGA FE 16GB

Describe the current behavior Slower than it was

Describe the expected behavior Not slower than it was

import keras
import numpy as np
import pandas as pd
import time
import tensorflow as tf

from keras.models import Model
from keras.layers import Dense, Activation, Input, LSTM, Embedding, Lambda
from keras.utils import plot_model

from sklearn.utils import class_weight
from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_curve, average_precision_score

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

X_train_sequence = pd.read_pickle('X_train_sequence_10percent.pkl')
y_train_sequence = pd.read_pickle('y_train_sequence_10percent.pkl')
X_test_sequence = pd.read_pickle('X_test_sequence_10percent.pkl')
y_test_sequence = pd.read_pickle('y_test_sequence_10percent.pkl')

sequence_len = X_train_sequence.shape[1] # 2384
vocabulary_size = 23472        # 23472
embedding_dim = 32
hidden_size = 100

nb_epoch = 1
batch_size = 128

class_weights = class_weight.compute_class_weight('balanced', np.unique(y_train_sequence), y_train_sequence)

def baseline_LSTMregul_model(sequence_len, vocabulary_size, embedding_dim, hidden_size):
    inputs     = Input(shape=(sequence_len,), name='input')
    embedding  = Embedding(vocabulary_size, embedding_dim, mask_zero=True,
                           input_length=sequence_len, name='embedding',
                           embeddings_regularizer = keras.regularizers.l2(0.01))(inputs)
    lstm1      = LSTM(hidden_size, name='LSTM1')(embedding)
    output = Dense(1, activation='sigmoid', name='output')(lstm1)
    return Model(inputs=[inputs], outputs=output)

def auroc(y_true, y_pred):
    return tf.py_func(roc_auc_score, (y_true, y_pred), tf.double)

model = baseline_LSTMregul_model(sequence_len, vocabulary_size,
                                 embedding_dim, hidden_size)

adam = keras.optimizers.Adam(lr=0.0001, beta_1=0.9, beta_2=0.999)
history = model.compile(loss='binary_crossentropy',
                        optimizer=adam,
                        metrics=['accuracy', auroc])

model.summary()

model_history = model.fit(X_train_sequence, y_train_sequence, validation_split = 0.1,
                          shuffle = True, class_weight=class_weights, epochs = nb_epoch,
                          batch_size=batch_size, verbose = 2)


y_test_predict = model.predict(X_test_sequence, verbose = 2)
print('Embedding + LSTM On test dataset:')
print('ROC AUC score: ', roc_auc_score(y_test_sequence, y_test_predict))
print('AP score: ', average_precision_score(y_test_sequence, y_test_predict))

before upgrade to 2.3+

#before 
Using TensorFlow backend.
WARNING:tensorflow:From /home/philix/.local/lib/python3.6/site-packages/tensorflow/python/ops/distributions/distribution.py:265: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/philix/.local/lib/python3.6/site-packages/tensorflow/python/ops/distributions/bernoulli.py:169: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input (InputLayer)           (None, 2384)              0         
_________________________________________________________________
embedding (Embedding)        (None, 2384, 32)          751104    
_________________________________________________________________
LSTM1 (LSTM)                 (None, 100)               53200     
_________________________________________________________________
output (Dense)               (None, 1)                 101       
=================================================================
Total params: 804,405
Trainable params: 804,405
Non-trainable params: 0
_________________________________________________________________
Train on 11872 samples, validate on 1320 samples
Epoch 1/1
2019-02-15 23:33:28.128459: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-02-15 23:33:28.130587: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1530] Found device 0 with properties: 
name: Vega 10 XTX [Radeon Vega Frontier Edition]
AMDGPU ISA: gfx900
memoryClockRate (GHz) 1.6
pciBusID 0000:03:00.0
Total memory: 15.98GiB
Free memory: 15.73GiB
2019-02-15 23:33:28.130602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Adding visible gpu devices: 0
2019-02-15 23:33:28.130616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-15 23:33:28.130621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1057]      0 
2019-02-15 23:33:28.130626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1070] 0:   N 
2019-02-15 23:33:28.130651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Vega 10 XTX [Radeon Vega Frontier Edition], pci bus id: 0000:03:00.0)
 - 300s - loss: 5.3948 - acc: 0.8455 - auroc: 0.4750 - val_loss: 3.9581 - val_acc: 0.8432 - val_auroc: 0.4409
Embedding + LSTM On test dataset:
ROC AUC score:  0.46297931034482753
AP score:  0.13855102539359127

after upgrade to 2.3+

# after 
Using TensorFlow backend.
WARNING:tensorflow:From /home/philix/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From benchmark_model_test_GPU.py:71: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
    tf.py_function, which takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input (InputLayer)           (None, 2384)              0         
_________________________________________________________________
embedding (Embedding)        (None, 2384, 32)          751104    
_________________________________________________________________
LSTM1 (LSTM)                 (None, 100)               53200     
_________________________________________________________________
output (Dense)               (None, 1)                 101       
=================================================================
Total params: 804,405
Trainable params: 804,405
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:From /home/philix/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 11872 samples, validate on 1320 samples
Epoch 1/1
2019-05-13 09:52:39.757238: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-05-13 09:52:39.758483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1531] Found device 0 with properties: 
name: Vega 10 XTX [Radeon Vega Frontier Edition]
AMDGPU ISA: gfx900
memoryClockRate (GHz) 1.6
pciBusID 0000:03:00.0
Total memory: 15.98GiB
Free memory: 15.73GiB
2019-05-13 09:52:39.758506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1642] Adding visible gpu devices: 0
2019-05-13 09:52:39.758517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-13 09:52:39.758535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1059]      0 
2019-05-13 09:52:39.758541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1072] 0:   N 
2019-05-13 09:52:39.758582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Vega 10 XTX [Radeon Vega Frontier Edition], pci bus id: 0000:03:00.0)
 - 369s - loss: 5.3969 - acc: 0.8375 - auroc: 0.4667 - val_loss: 3.9643 - val_acc: 0.8432 - val_auroc: 0.4411
Embedding + LSTM On test dataset:
ROC AUC score:  0.4534620689655172
AP score:  0.1374587944882264

files

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 16

Most upvoted comments

Hi @Dekken , please refer to the following section that describes the ROCm upstream kernel support status: https://github.com/RadeonOpenCompute/ROCm#rocm-support-in-upstream-linux-kernels Especially the following two “Cons” when using upstream kernel:

  • Not tested by AMD to the same level as rock-dkms package
  • Does not include most up-to-date firmware

If the goal is to have optimal performance and under better ROCm QA test coverage, we recommend to use the 4.15 or 4.18 kernels with the up-to-date rock-dkms packages. I’ll reopen this issue in case you have further questions, please kindly close the new one #462

Hi @Dekken , is it okay to close this issue? The TF-ROCm repo is not supposed to be used tracking the progress or issues on upstream kernel support. The following rock-dkms driver repo might be a more proper place: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver

Hi @Dekken , I’ll leave it for @jlgreathouse to comment on your questions. The link I’ve referred to in the last comment would be the best place to track the upstream kernel support status.

I’m pretty sure I had ubuntu 18 on both boxes back then so I’m going to install it again and test

I can’t say anything about the compilation issue, but the ~20% slowdown seems in line with what I experienced on an upstream kernel. There was an improvement of more than 25% on one of the benchmarks in my case (https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173#issuecomment-465940289 and https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173#issuecomment-466230523). Iirc, the reply I received somewhere else (can’t find the comment) was that the driver in the upstream kernel is not reviewed/tested as thoroughly as dkms.