plaidml: Stripe problems with amd RX580

I just tested out the stripe backend with an amd card (radeon rx580) and while MNIST ran without problem (albeit a bit slower as without stripe) i ran into a crash with the resnet50 model.

Traceback (most recent call last):
  File "resnet50.py", line 14, in <module>
    preprocess_input(img), np.zeros((SAMPLES, 1000), dtype='float32'), batch_size=BS, epochs=5)
  File "/home/nope/venvs/fs/lib/python3.7/site-packages/keras/engine/training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "/home/nope/venvs/fs/lib/python3.7/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop
    outs = f(ins_batch)
  File "/home/nope/venvs/fs/lib/python3.7/site-packages/plaidml/keras/backend.py", line 176, in __call__
    self._invoker.invoke()
  File "/home/nope/venvs/fs/lib/python3.7/site-packages/plaidml/__init__.py", line 1440, in invoke
    return Invocation(self._ctx, self)
  File "/home/nope/venvs/fs/lib/python3.7/site-packages/plaidml/__init__.py", line 1449, in __init__
    self._as_parameter_ = _lib().plaidml_schedule_invocation(ctx, invoker)
  File "/home/nope/venvs/fs/lib/python3.7/site-packages/plaidml/__init__.py", line 764, in _check_err
    self.raise_last_status()
  File "/home/nope/venvs/fs/lib/python3.7/site-packages/plaidml/library.py", line 131, in raise_last_status
    raise self.last_status()
plaidml.exceptions.Unknown: AliasMap::AliasMap: Mismatched access dimensions on refinement: d1:X_T18 X_T18

code:

from keras.applications.resnet50 import ResNet50, preprocess_input, decode_predictions
import numpy as np

if __name__ == '__main__':
    BS=8
    SAMPLES=BS*50
    model = ResNet50(weights='imagenet')
    img = np.random.rand(SAMPLES, 224, 224, 3)
    preds = model.predict(preprocess_input(img)[:BS])
    print('Predicted:', decode_predictions(preds, top=3)[0])

    model.compile("SGD", loss='categorical_crossentropy')
    preds = model.fit(
        preprocess_input(img), np.zeros((SAMPLES, 1000), dtype='float32'), batch_size=BS, epochs=5)

Prediction works fine, but training crashes with the above stacktrace. I am using plaidml 0.6.0

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 16 (11 by maintainers)

Most upvoted comments

I minimized the above code a bit more:

#!/usr/bin/env python3
from keras.layers import Input, Conv2D, Lambda
from keras.models import Model as KerasModel
import keras.backend as K
import numpy as np
from time import time as now

if K.backend() == "plaidml.keras.backend":
    import plaidml
    import plaidml.op
    def pad(data, paddings, mode="CONSTANT", name=None, constant_value=0):
        """ PlaidML Pad """
        if mode.upper() != "REFLECT" or constant_value != 0:
            raise NotImplementedError("Unsupported arguments.")
        return plaidml.op.reflection_padding(data, paddings)
else:
    from tensorflow import pad

USE_REFLECTIVE_PADDING = True
INPUT_SHAPE = (64, 64, 3)
BS = 8

_i = 0
def ReflectionPadding(x, pad_t, pad_b, pad_l, pad_r):
    global _i
    n_shape = list(K.int_shape(x)[1:])
    n_shape[0] += pad_t + pad_b
    n_shape[1] += pad_l + pad_r
    layer = Lambda(lambda t: pad(t, [[0, 0], [pad_t, pad_b], [pad_l, pad_r], [0, 0]], "REFLECT"), name="ReflectionPadding_%i" % _i, output_shape=n_shape)
    _i += 1
    return layer(x)


if __name__ == '__main__':
    padding = "same"
    if USE_REFLECTIVE_PADDING:
        padding = "valid"

    x = inp = Input(INPUT_SHAPE)
    if USE_REFLECTIVE_PADDING:
        x = ReflectionPadding(x, 2, 2, 1, 2)
    x = Conv2D(128, kernel_size=5, strides=2, padding=padding)(x)
    if USE_REFLECTIVE_PADDING:
        x = ReflectionPadding(x, 2, 2, 1, 2)
    x = Conv2D(256, kernel_size=5, strides=2, padding=padding)(x)
    model = KerasModel(inp, x)
    model.summary()

    train_x = np.ones((BS,) + INPUT_SHAPE)
    train_y = np.ones((BS,) + tuple(K.int_shape(x)[1:]))
    model.compile("Adam", "mse")

    for i in range(20):
        stime = now()
        model.train_on_batch(train_x, train_y)
        stime = now() - stime
        print("Train batch %i in %.4f (%.2f)" % (i, stime, BS / stime))

With PLAIDML_USE_STRIPE=1 this leads to the above stacktrace…

With USE_REFLECTIVE_PADDING set to False in the snippet stripe hangs for ever with one CPU core being maxed out. This hanging is depended on the batch size. It works fine with a BS <8, hangs with 8, works well with a BS > 8 < 13 and hangs again with BS >= 13.

Tested with the current master branch.