tensorflow-upstream: Simple conolution network fails to converge when trained on GPU

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): - NO
OS Platform and Distribution : Linux Ubuntu 18.04:
Mobile device: gfx803
Dockerimage :rocm/tensorflow:rocm2.3-tf1.13-python3-dev-v2

Describe the current behavior Network doesn’t converge at all (stays at ~70%) when trainged with GPU, it converges as expected with trained on CPU.

Describe the expected behavior Network should convertge on 99.9% Test should converge on 99.9% accuracy

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

   docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/tensorflow:rocm2.3-tf1.13-python3-dev-v2``

cat <<EOF >test.py
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.utils import to_categorical
batch_size = 128
num_classes = 10
epochs = 10
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
y_train = to_categorical(y_train, num_classes)
y_test = to_categorical(y_test, num_classes)
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))
EOF

Train it with CPU:, it quickly converges

root@7c9f165921cd:/root# env HIP_VISIBLE_DEVICES= python3 test.py
Epoch 1/10
60000/60000 [==============================] - 5s 82us/sample - loss: 0.2448 - acc: 0.9249 - val_loss: 0.1098 - val_acc: 0.9647
Epoch 2/10
60000/60000 [==============================] - 5s 79us/sample - loss: 0.1016 - acc: 0.9686 - val_loss: 0.0766 - val_acc: 0.9760

Traiin the same network with GPU, it never converges stays at ~70%

root@7c9f165921cd:/root# python3 test.py
...
Epoch 1/10
60000/60000 [==============================] - 7s 116us/sample - loss: 1.2364 - acc: 0.5956 - val_loss: 0.7367 - val_acc: 0.7895
Epoch 2/10
60000/60000 [==============================] - 3s 49us/sample - loss: 1.2925 - acc: 0.6091 - val_loss: 0.7422 - val_acc: 0.7685
...

Other info / logs Full results wirh CPU and GPU training

2019-05-03 17:05:38.317485: E tensorflow/stream_executor/rocm/rocm_driver.cc:965] could not retrieve ROCM device count: HIP_ERROR_NoDevice
Epoch 1/10
60000/60000 [==============================] - 4s 72us/sample - loss: 0.2490 - acc: 0.9238 - val_loss: 0.1167 - val_acc: 0.9649
Epoch 2/10
60000/60000 [==============================] - 4s 69us/sample - loss: 0.1027 - acc: 0.9690 - val_loss: 0.0872 - val_acc: 0.9744
Epoch 3/10
60000/60000 [==============================] - 4s 69us/sample - loss: 0.0768 - acc: 0.9769 - val_loss: 0.0889 - val_acc: 0.9733
Epoch 4/10
60000/60000 [==============================] - 4s 69us/sample - loss: 0.0602 - acc: 0.9812 - val_loss: 0.0688 - val_acc: 0.9807
Epoch 5/10
60000/60000 [==============================] - 4s 69us/sample - loss: 0.0501 - acc: 0.9844 - val_loss: 0.0753 - val_acc: 0.9802
Epoch 6/10
60000/60000 [==============================] - 4s 69us/sample - loss: 0.0432 - acc: 0.9869 - val_loss: 0.0815 - val_acc: 0.9805
Epoch 7/10
60000/60000 [==============================] - 4s 69us/sample - loss: 0.0418 - acc: 0.9876 - val_loss: 0.0864 - val_acc: 0.9793
Epoch 8/10
60000/60000 [==============================] - 4s 69us/sample - loss: 0.0345 - acc: 0.9901 - val_loss: 0.0796 - val_acc: 0.9808
Epoch 9/10
60000/60000 [==============================] - 4s 70us/sample - loss: 0.0296 - acc: 0.9912 - val_loss: 0.0854 - val_acc: 0.9809
Epoch 10/10
60000/60000 [==============================] - 4s 69us/sample - loss: 0.0300 - acc: 0.9918 - val_loss: 0.0894 - val_acc: 0.9830

2019-05-03 17:12:31.273289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3540 MB memory) -> physical GPU (device: 0, name: Device 67df, pci bus id: 0000:09:00.0)
Epoch 1/10
60000/60000 [==============================] - 6s 102us/sample - loss: 1.2426 - acc: 0.5898 - val_loss: 0.7654 - val_acc: 0.7677
Epoch 2/10
60000/60000 [==============================] - 3s 45us/sample - loss: 1.3728 - acc: 0.5934 - val_loss: 0.8342 - val_acc: 0.7302
Epoch 3/10
60000/60000 [==============================] - 3s 44us/sample - loss: 1.7013 - acc: 0.5693 - val_loss: 1.2858 - val_acc: 0.6787
Epoch 4/10
60000/60000 [==============================] - 3s 45us/sample - loss: 2.2682 - acc: 0.5310 - val_loss: 1.0485 - val_acc: 0.6881
Epoch 5/10
60000/60000 [==============================] - 3s 45us/sample - loss: 2.1386 - acc: 0.5458 - val_loss: 1.0938 - val_acc: 0.6810
Epoch 6/10
60000/60000 [==============================] - 3s 45us/sample - loss: 1.9120 - acc: 0.5681 - val_loss: 1.1036 - val_acc: 0.6752
Epoch 7/10
60000/60000 [==============================] - 3s 45us/sample - loss: 1.7704 - acc: 0.5798 - val_loss: 1.2772 - val_acc: 0.6148
Epoch 8/10
60000/60000 [==============================] - 3s 44us/sample - loss: 1.5617 - acc: 0.5994 - val_loss: 0.9766 - val_acc: 0.7225
Epoch 9/10
60000/60000 [==============================] - 3s 44us/sample - loss: 1.5330 - acc: 0.6077 - val_loss: 0.9767 - val_acc: 0.7398
Epoch 10/10
60000/60000 [==============================] - 3s 44us/sample - loss: 1.4275 - acc: 0.6277 - val_loss: 0.9561 - val_acc: 0.6873

    /opt/rocm/bin/rocminfo
   =====================
   HSA System Attributes
   =====================
   Runtime Version:         1.1
   System Timestamp Freq.:  1000.000000MHz
   Sig. Max Wait Duration:  18446744073709551615 (number of timestamp)   
   Machine Model:           LARGE
   System Endianness:       LITTLE

   ==========
   HSA Agents
   ==========
   *******
   Agent 1
   *******
     Name:                    AMD Ryzen 7 1700 Eight-Core Processor
     Vendor Name:             CPU
   *******
   Agent 2
   *******
     Name:                    gfx803
     Vendor Name:             AMD
     Feature:                 KERNEL_DISPATCH
     Profile:                 BASE_PROFILE
     Float Round Mode:        NEAR
     Max Queue Number:        128
     Queue Min Size:          4096
     Queue Max Size:          131072
     Queue Type:              MULTI
     Node:                    1
     Device Type:             GPU
     Cache Info:
       L1:                      16KB
     Chip ID:                 26591
     Cacheline Size:          64

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 19

Most upvoted comments

@TheCBaH could you check this issue with the ROCm2.4 based TF1.13 release?

ROCm 2.5 with TFL1.13 fails test on gfx803 and passes on gfx900

TheCBaH on Jun 28, 2019