keras: multi_gpu_model fails with timeseries data
It appears that multi_gpu_model
does not adequately support time series datasets. The cause is most likely the get_slice
function.
As seen in issue #11941 , when multiple GPUs are used the predicted results have similar patterns to those provided with a single GPU, but with a much lower range. Issue #11941 demonstrates this effect on a sinusoidal dataset where the predicted waveform is an attenuated version of the original when multiple GPUs are engaged. (A sinusoidal waveform was chosen for clarity). Issue #11941 has working code that illustrates the problem.
I suspect the cause for this is in the get_slice
function in https://github.com/keras-team/keras/blob/master/keras/utils/multi_gpu_utils.py
The get_slice
function splits the data in n
segments and allocates each segment to one of n
GPUs for processing. From my understanding of LSTM / GRU models, it is important to maintain an ongoing feedback loop for predictions to occur. By splitting up the data into n
segments, each GPU is working on a 1/n
th set of data, each GPU starts work on each segment without knowing the end state of the results of the previous segment.
In other words the weight calculations get interrupted n
times over the dataset starting over from scratch each time.
The concatenation of the weights over the entire dataset is therefore affected as each GPU has only worked on a subset of data reducing the effectiveness of the weights calculation. This probably explains the great difference in the RMSE values of single versus multi GPU predictions and the attenuation in the charts.
def get_slice(data, i, parts):
shape = K.shape(data)
batch_size = shape[:1]
input_shape = shape[1:]
step = batch_size // parts
if i == parts - 1:
size = batch_size - step * i
else:
size = step
size = K.concatenate([size, input_shape], axis=0)
stride = K.concatenate([step, input_shape * 0], axis=0)
start = stride * i
return K.slice(data, start, size)
If my suspicion is true, then at a minimum the current multi_gpu_model
documentation should advise against using it with timeseries data.
I’d also suggest the addition of a flag to change behavior for timeseries data. One approach to use could be to make each GPU work in parallel on the entire dataset (epoch). The parallel epoch run with the best RMSE or similar criterion would be selected for the weights to be used for the subsequent epoch. It doesn’t speed up calculations though, and I’d welcome an approach that does.
My concern is that this failure generates no errors with val_loss
values improving with each epoch. This lulls users into thinking that there are no issues when in fact there are.
Ubuntu 18.04.1 LTS
Keras 2.2.4
Keras-Applications 1.0.6
Keras-Preprocessing 1.0.5
tensorboard 1.12.0
tensorflow-gpu 1.12.0
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 19
Thanks @gibrano, I finally got back to looking at this. I made some modifications to your code example. The charts look good for up to 3 of 4 GPUs. (I have a 4 GPU system, so this could be an n-1 issue). Read more below.
However it fails when running on 4 GPUs (the max on my system). I used the batch size suggestion from @dolaamon2, but it fails for a variety of sizes.
The failure appears to be on the very first epoch run.
I’m running:
Comparing the logs from 3 vs 4 GPUs, the crash occurs crash right after all the gradients are calculated on all GPUs.
I know Cuda 10-2 isn’t recommended, so I’ll downgrade and see if the issue persists and will update this thread. If it does, then I’ll open another issue, as TensorFlow 2.2 seems to be working otherwise.
get pic: pic
@palisadoes (fellow machine learning hobbyist), unfortunately I haven’t acquired a multi-gpu setup yet (as I mentioned above I was kind of waiting for this issue to be resolved first but at the moment my single watercooled 1080ti works just fine) so I cannot test code for a >1 GPU setup.
In the example from @sparkingarthur the serial model is not trained, I don’t see any
.fit
or.compile
for the serial model. However, I assume @sparkingarthur means to study the differences between making predictions on the serial model and one the parallel model (and in this comparison that’s OK), and I think these lines do exactly this:I don’t think it is possible to a obtain “correct” prediction using a multi_gpu setup for the same reason as why it is not possible to obtain a “correct” prediction when using a batch size > 1 for a stateful LSTM (time series) model (this regardless of the number of GPUs, of course). However, this doesn’t mean that there is a more sophisticated way of splitting the data for a multi_gpu_model. It should be possible to improve the
get_slice()
function, or what do you think?No I didn’t use NV link.
@sparkingarthur ,
Well put.
I understand that the result is degraded when using > 1 GPU…however, I don’t see that this has to be the case. When performing the training of the single GPU, parallelization is used (correct me if I´m wrong here )…with this said one could hope that parallelization over several GPU should not be an issue (like in the example I showed above in which sequence bucketing was used (see Figure 5)). I agree that scaling (referring to computational resource) may be influenced in a multi-GPU setup, however, it should be possible to obtain the same loss for a multi-GPU setup as for a single-GPU setup.
BTW, what did you think of sequence bucketing? Something worth putting a bit of my time on?