tensorflow: TPU train_on_batch stride size error

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 9
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.13.1
Python version: 3.5
GPU model and memory: TPU v3-8

Describe the current behavior

Code and data which runs fine on CPU, throws error on TPU. This only happens if I use train_on_batch instead of fit.

I have 2 versions of same model. One is with fit with 2 loops and with train_on_batch with 3 loops (epoch, day worth of data, batch within day)

train_on_batch throws error: slice index 0 of dimension 0 out of bounds. for 'strided_slice' (op: 'StridedSlice') with input shapes: [0], [1], [1], [1] and with computed input tensors: input[1] = <0>, input[2] = <1>, input[3] = <1>.

0111 data is label provided in y2. and the size of 4 is correct. Why is computed input tensor is size 3 I don’t understand. It looks like a bug very much.

Model

model = tf.keras.Sequential()
model.add(layers.LSTM(neurons, input_shape=(window_size, inputs_n), return_sequences=True)) 
model.add(layers.LSTM(neurons))
model.add(layers.Dense(outputs_n, activation='sigmoid'))

opt = tf.train.AdamOptimizer(0.001)

model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['categorical_accuracy'])
 
tpu_model = tf.contrib.tpu.keras_to_tpu_model(model, 
        strategy=tf.contrib.tpu.TPUDistributionStrategy(
            tf.contrib.cluster_resolver.TPUClusterResolver(tpu = [TPU_ADDRESS1])))

Shapes

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_input (InputLayer)      (None, 1024, 7)           0         
_________________________________________________________________
lstm (LSTM)                  (None, 1024, 128)         69632     
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 4)                 516       
=================================================================

Training:

for epoch in epochs:
    for d in days : 
        # get arrays for the day
        features = np.asarray(d[1])[:,2:9].astype(dtype = 'float32')
        labels = np.asarray(d[1])[:, 9:13].astype(dtype = 'int32')
        
        X,y = split_sequence(features, labels_buy, window_size)

        # train 
        for slide in range(window_size):
            try:
                x1, y1 = X[slide], y[slide]
                x2, y2 = x1.reshape(1,1024,7), y1.reshape(1, 4)
                H = tpu_model.train_on_batch(x2,y2)
            except Exception as e:
                print('** train exception **', e)
                continue

Describe the expected behavior

train_on_batch trains without exception

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 17 (8 by maintainers)

Most upvoted comments

@maxima120 I’ve reassigned it to folks more familiar with the code, but with TensorFlow 1.14, can you try distribution strategy instead? That is a more complete implementation: https://www.tensorflow.org/guide/distribute_strategy#using_tfdistributestrategy_with_keras

frankchn on Jun 28, 2019

this reproduces the issue: https://gist.github.com/maxima120/d1057a0e4bbf2ae2a1434dad57999a3f

maxima120 on Jun 22, 2019