tensorflow: TPU train_on_batch stride size error
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 9
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): 1.13.1
- Python version: 3.5
- GPU model and memory: TPU v3-8
Describe the current behavior
Code and data which runs fine on CPU, throws error on TPU. This only happens if I use train_on_batch instead of fit.
I have 2 versions of same model. One is with fit with 2 loops and with train_on_batch with 3 loops (epoch, day worth of data, batch within day)
train_on_batch throws error: slice index 0 of dimension 0 out of bounds. for 'strided_slice' (op: 'StridedSlice') with input shapes: [0], [1], [1], [1] and with computed input tensors: input[1] = <0>, input[2] = <1>, input[3] = <1>.
0111 data is label provided in y2. and the size of 4 is correct. Why is computed input tensor is size 3 I don’t understand. It looks like a bug very much.
Model
model = tf.keras.Sequential()
model.add(layers.LSTM(neurons, input_shape=(window_size, inputs_n), return_sequences=True))
model.add(layers.LSTM(neurons))
model.add(layers.Dense(outputs_n, activation='sigmoid'))
opt = tf.train.AdamOptimizer(0.001)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['categorical_accuracy'])
tpu_model = tf.contrib.tpu.keras_to_tpu_model(model,
strategy=tf.contrib.tpu.TPUDistributionStrategy(
tf.contrib.cluster_resolver.TPUClusterResolver(tpu = [TPU_ADDRESS1])))
Shapes
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_input (InputLayer) (None, 1024, 7) 0
_________________________________________________________________
lstm (LSTM) (None, 1024, 128) 69632
_________________________________________________________________
lstm_1 (LSTM) (None, 128) 131584
_________________________________________________________________
dense (Dense) (None, 4) 516
=================================================================
Training:
for epoch in epochs:
for d in days :
# get arrays for the day
features = np.asarray(d[1])[:,2:9].astype(dtype = 'float32')
labels = np.asarray(d[1])[:, 9:13].astype(dtype = 'int32')
X,y = split_sequence(features, labels_buy, window_size)
# train
for slide in range(window_size):
try:
x1, y1 = X[slide], y[slide]
x2, y2 = x1.reshape(1,1024,7), y1.reshape(1, 4)
H = tpu_model.train_on_batch(x2,y2)
except Exception as e:
print('** train exception **', e)
continue
Describe the expected behavior
train_on_batch trains without exception
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (8 by maintainers)
@maxima120 I’ve reassigned it to folks more familiar with the code, but with TensorFlow 1.14, can you try distribution strategy instead? That is a more complete implementation: https://www.tensorflow.org/guide/distribute_strategy#using_tfdistributestrategy_with_keras
this reproduces the issue: https://gist.github.com/maxima120/d1057a0e4bbf2ae2a1434dad57999a3f