tensorflow-upstream: Slow down on LSTM since 2.3
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 9
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary): tensorflow-rocm (pip)
- TensorFlow version (use command below): b’v1.13.1-691-gf092438’ 1.13.1
- Python version: 3.6.7
- GCC/Compiler version (if compiling from source): 6.3
- ROCm/MIOpen version: 2.2 vs latest
- GPU model and memory: Dual VEGA FE 16GB
Describe the current behavior Slower than it was
Describe the expected behavior Not slower than it was
import keras
import numpy as np
import pandas as pd
import time
import tensorflow as tf
from keras.models import Model
from keras.layers import Dense, Activation, Input, LSTM, Embedding, Lambda
from keras.utils import plot_model
from sklearn.utils import class_weight
from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_curve, average_precision_score
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
X_train_sequence = pd.read_pickle('X_train_sequence_10percent.pkl')
y_train_sequence = pd.read_pickle('y_train_sequence_10percent.pkl')
X_test_sequence = pd.read_pickle('X_test_sequence_10percent.pkl')
y_test_sequence = pd.read_pickle('y_test_sequence_10percent.pkl')
sequence_len = X_train_sequence.shape[1] # 2384
vocabulary_size = 23472 # 23472
embedding_dim = 32
hidden_size = 100
nb_epoch = 1
batch_size = 128
class_weights = class_weight.compute_class_weight('balanced', np.unique(y_train_sequence), y_train_sequence)
def baseline_LSTMregul_model(sequence_len, vocabulary_size, embedding_dim, hidden_size):
inputs = Input(shape=(sequence_len,), name='input')
embedding = Embedding(vocabulary_size, embedding_dim, mask_zero=True,
input_length=sequence_len, name='embedding',
embeddings_regularizer = keras.regularizers.l2(0.01))(inputs)
lstm1 = LSTM(hidden_size, name='LSTM1')(embedding)
output = Dense(1, activation='sigmoid', name='output')(lstm1)
return Model(inputs=[inputs], outputs=output)
def auroc(y_true, y_pred):
return tf.py_func(roc_auc_score, (y_true, y_pred), tf.double)
model = baseline_LSTMregul_model(sequence_len, vocabulary_size,
embedding_dim, hidden_size)
adam = keras.optimizers.Adam(lr=0.0001, beta_1=0.9, beta_2=0.999)
history = model.compile(loss='binary_crossentropy',
optimizer=adam,
metrics=['accuracy', auroc])
model.summary()
model_history = model.fit(X_train_sequence, y_train_sequence, validation_split = 0.1,
shuffle = True, class_weight=class_weights, epochs = nb_epoch,
batch_size=batch_size, verbose = 2)
y_test_predict = model.predict(X_test_sequence, verbose = 2)
print('Embedding + LSTM On test dataset:')
print('ROC AUC score: ', roc_auc_score(y_test_sequence, y_test_predict))
print('AP score: ', average_precision_score(y_test_sequence, y_test_predict))
before upgrade to 2.3+
#before
Using TensorFlow backend.
WARNING:tensorflow:From /home/philix/.local/lib/python3.6/site-packages/tensorflow/python/ops/distributions/distribution.py:265: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/philix/.local/lib/python3.6/site-packages/tensorflow/python/ops/distributions/bernoulli.py:169: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input (InputLayer) (None, 2384) 0
_________________________________________________________________
embedding (Embedding) (None, 2384, 32) 751104
_________________________________________________________________
LSTM1 (LSTM) (None, 100) 53200
_________________________________________________________________
output (Dense) (None, 1) 101
=================================================================
Total params: 804,405
Trainable params: 804,405
Non-trainable params: 0
_________________________________________________________________
Train on 11872 samples, validate on 1320 samples
Epoch 1/1
2019-02-15 23:33:28.128459: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-02-15 23:33:28.130587: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1530] Found device 0 with properties:
name: Vega 10 XTX [Radeon Vega Frontier Edition]
AMDGPU ISA: gfx900
memoryClockRate (GHz) 1.6
pciBusID 0000:03:00.0
Total memory: 15.98GiB
Free memory: 15.73GiB
2019-02-15 23:33:28.130602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Adding visible gpu devices: 0
2019-02-15 23:33:28.130616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-15 23:33:28.130621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1057] 0
2019-02-15 23:33:28.130626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1070] 0: N
2019-02-15 23:33:28.130651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Vega 10 XTX [Radeon Vega Frontier Edition], pci bus id: 0000:03:00.0)
- 300s - loss: 5.3948 - acc: 0.8455 - auroc: 0.4750 - val_loss: 3.9581 - val_acc: 0.8432 - val_auroc: 0.4409
Embedding + LSTM On test dataset:
ROC AUC score: 0.46297931034482753
AP score: 0.13855102539359127
after upgrade to 2.3+
# after
Using TensorFlow backend.
WARNING:tensorflow:From /home/philix/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From benchmark_model_test_GPU.py:71: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
tf.py_function, which takes a python function which manipulates tf eager
tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means `tf.py_function`s can use accelerators such as GPUs as well as
being differentiable using a gradient tape.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input (InputLayer) (None, 2384) 0
_________________________________________________________________
embedding (Embedding) (None, 2384, 32) 751104
_________________________________________________________________
LSTM1 (LSTM) (None, 100) 53200
_________________________________________________________________
output (Dense) (None, 1) 101
=================================================================
Total params: 804,405
Trainable params: 804,405
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:From /home/philix/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 11872 samples, validate on 1320 samples
Epoch 1/1
2019-05-13 09:52:39.757238: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-05-13 09:52:39.758483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1531] Found device 0 with properties:
name: Vega 10 XTX [Radeon Vega Frontier Edition]
AMDGPU ISA: gfx900
memoryClockRate (GHz) 1.6
pciBusID 0000:03:00.0
Total memory: 15.98GiB
Free memory: 15.73GiB
2019-05-13 09:52:39.758506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1642] Adding visible gpu devices: 0
2019-05-13 09:52:39.758517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-13 09:52:39.758535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1059] 0
2019-05-13 09:52:39.758541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1072] 0: N
2019-05-13 09:52:39.758582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Vega 10 XTX [Radeon Vega Frontier Edition], pci bus id: 0000:03:00.0)
- 369s - loss: 5.3969 - acc: 0.8375 - auroc: 0.4667 - val_loss: 3.9643 - val_acc: 0.8432 - val_auroc: 0.4411
Embedding + LSTM On test dataset:
ROC AUC score: 0.4534620689655172
AP score: 0.1374587944882264
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16
Hi @Dekken , please refer to the following section that describes the ROCm upstream kernel support status: https://github.com/RadeonOpenCompute/ROCm#rocm-support-in-upstream-linux-kernels Especially the following two “Cons” when using upstream kernel:
If the goal is to have optimal performance and under better ROCm QA test coverage, we recommend to use the 4.15 or 4.18 kernels with the up-to-date rock-dkms packages. I’ll reopen this issue in case you have further questions, please kindly close the new one #462
Hi @Dekken , is it okay to close this issue? The TF-ROCm repo is not supposed to be used tracking the progress or issues on upstream kernel support. The following rock-dkms driver repo might be a more proper place: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver
Hi @Dekken , I’ll leave it for @jlgreathouse to comment on your questions. The link I’ve referred to in the last comment would be the best place to track the upstream kernel support status.
I’m pretty sure I had ubuntu 18 on both boxes back then so I’m going to install it again and test
I can’t say anything about the compilation issue, but the ~20% slowdown seems in line with what I experienced on an upstream kernel. There was an improvement of more than 25% on one of the benchmarks in my case (https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173#issuecomment-465940289 and https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173#issuecomment-466230523). Iirc, the reply I received somewhere else (can’t find the comment) was that the driver in the upstream kernel is not reviewed/tested as thoroughly as dkms.