tensorflow: Random slow downs during training.
Usually the speed of each training batch is somewhat consistent for me when using tensorflow. However, I am currently training an RNN. For the first 3 epochs each batch of 128 samples took about 1 second. Now on the third epoch my times have become inconsistent. The batches will randomly become very slow. I can hear the fan on my GTX Titian spin down indicating that during these slowdowns it is not using the compute resources on the GPU.
I am timing it like this so I don’t think anything else in my code could be affecting it.
begin_time = time.time()
loss, ts = sess.run([cost, train_step], feed_dict = {input_tensor: x_train, expected_output: y_train, keep_prob: 0.8})
end_time = time.time()
Here is the output with timing information.
Epoch 3 Batch 858 Loss 172.072438250 Last Loss 206.985626221 Time 1.432
Epoch 3 Batch 859 Loss 172.067967827 Last Loss 168.227874756 Time 1.419
Epoch 3 Batch 860 Loss 172.057925642 Last Loss 163.421646118 Time 1.447
Epoch 3 Batch 861 Loss 172.056937587 Last Loss 171.206222534 Time 1.339
Epoch 3 Batch 862 Loss 172.051488565 Last Loss 167.354431152 Time 1.285
Epoch 3 Batch 863 Loss 172.016926642 Last Loss 142.189987183 Time 1.310
Epoch 3 Batch 864 Loss 172.011091517 Last Loss 166.969543457 Time 1.291
Epoch 3 Batch 865 Loss 172.011779468 Last Loss 172.606857300 Time 1.317
Epoch 3 Batch 866 Loss 172.004100742 Last Loss 165.354324341 Time 1.307
Epoch 3 Batch 867 Loss 172.008801860 Last Loss 176.084671021 Time 1.372
Epoch 3 Batch 868 Loss 172.032961320 Last Loss 193.003372192 Time 1.298
Epoch 3 Batch 869 Loss 172.036121868 Last Loss 174.782638550 Time 1.310
Epoch 3 Batch 870 Loss 172.044807513 Last Loss 179.601318359 Time 1.429
Epoch 3 Batch 871 Loss 172.066208551 Last Loss 190.706512451 Time 1.311
Epoch 3 Batch 872 Loss 172.052568940 Last Loss 160.158828735 Time 2.614
Epoch 3 Batch 873 Loss 172.032694941 Last Loss 154.682693481 Time 2.208
Epoch 3 Batch 874 Loss 172.040674805 Last Loss 179.015075684 Time 3.138
Epoch 3 Batch 875 Loss 172.034015673 Last Loss 166.207275391 Time 1.750
Epoch 3 Batch 876 Loss 172.029174909 Last Loss 167.788665771 Time 1.955
Epoch 3 Batch 877 Loss 172.042762183 Last Loss 183.958801270 Time 2.576
Epoch 3 Batch 878 Loss 172.028727179 Last Loss 159.705993652 Time 3.130
Epoch 3 Batch 879 Loss 172.035255710 Last Loss 177.773834229 Time 2.622
Epoch 3 Batch 880 Loss 172.043415273 Last Loss 179.223831177 Time 1.581
Epoch 3 Batch 881 Loss 172.040243793 Last Loss 169.246170044 Time 3.093
Epoch 3 Batch 882 Loss 172.017778805 Last Loss 152.203659058 Time 2.470
Epoch 3 Batch 883 Loss 172.018503957 Last Loss 172.658813477 Time 2.540
Epoch 3 Batch 884 Loss 172.054275064 Last Loss 203.675933838 Time 2.597
Epoch 3 Batch 885 Loss 172.029123448 Last Loss 149.769943237 Time 2.915
Epoch 3 Batch 886 Loss 172.013957027 Last Loss 158.576507568 Time 3.095
Epoch 3 Batch 887 Loss 171.994215424 Last Loss 154.483413696 Time 2.250
Epoch 3 Batch 888 Loss 171.998252946 Last Loss 175.583572388 Time 2.997
Epoch 3 Batch 889 Loss 171.974352959 Last Loss 150.727264404 Time 3.417
Epoch 3 Batch 890 Loss 171.968166587 Last Loss 166.462295532 Time 2.290
Epoch 3 Batch 891 Loss 172.011782095 Last Loss 210.873199463 Time 1.358
Epoch 3 Batch 892 Loss 172.013166695 Last Loss 173.248229980 Time 0.910
Epoch 3 Batch 893 Loss 172.016952293 Last Loss 175.397491455 Time 0.893
Epoch 3 Batch 894 Loss 172.015184030 Last Loss 170.434356689 Time 0.986
Epoch 3 Batch 895 Loss 172.001527531 Last Loss 159.778961182 Time 1.000
Epoch 3 Batch 896 Loss 172.008338426 Last Loss 178.110900879 Time 0.999
Epoch 3 Batch 897 Loss 172.015577083 Last Loss 178.508651733 Time 1.699
Epoch 3 Batch 898 Loss 172.076713689 Last Loss 226.977386475 Time 1.570
Epoch 3 Batch 899 Loss 172.043461711 Last Loss 142.149932861 Time 1.699
Epoch 3 Batch 900 Loss 172.069320933 Last Loss 195.342620850 Time 1.678
Epoch 3 Batch 901 Loss 172.063626562 Last Loss 166.932998657 Time 1.685
Epoch 3 Batch 902 Loss 172.060249506 Last Loss 169.014144897 Time 1.467
Epoch 3 Batch 903 Loss 172.068654811 Last Loss 179.658645630 Time 3.737
Epoch 3 Batch 904 Loss 172.061790508 Last Loss 165.856460571 Time 3.705
Epoch 3 Batch 905 Loss 172.078748707 Last Loss 187.425918579 Time 3.376
Epoch 3 Batch 906 Loss 172.065538518 Last Loss 160.097106934 Time 4.606
Epoch 3 Batch 907 Loss 172.040349112 Last Loss 149.193557739 Time 4.513
Epoch 3 Batch 908 Loss 172.078197416 Last Loss 206.444458008 Time 2.491
Epoch 3 Batch 909 Loss 172.081030156 Last Loss 174.655990601 Time 4.244
Here is what the output normally looks like most of the time when this issue is not occurring.
Epoch 1 Batch 213 Loss 262.820208541 Last Loss 189.761398315 Time 0.978
Epoch 1 Batch 214 Loss 262.570083973 Last Loss 209.043426514 Time 0.984
Epoch 1 Batch 215 Loss 262.265294534 Last Loss 196.735565186 Time 0.985
Epoch 1 Batch 216 Loss 261.973181588 Last Loss 198.876785278 Time 0.988
Epoch 1 Batch 217 Loss 261.685996729 Last Loss 199.366882324 Time 0.989
Epoch 1 Batch 218 Loss 261.385472058 Last Loss 195.871093750 Time 0.981
Epoch 1 Batch 219 Loss 261.200773135 Last Loss 220.751708984 Time 0.983
Epoch 1 Batch 220 Loss 260.912834202 Last Loss 197.566268921 Time 0.982
Epoch 1 Batch 221 Loss 260.757691254 Last Loss 226.471099854 Time 0.981
Epoch 1 Batch 222 Loss 260.588221186 Last Loss 222.965866089 Time 0.984
Epoch 1 Batch 223 Loss 260.272108146 Last Loss 189.778900146 Time 0.992
Epoch 1 Batch 224 Loss 259.962260607 Last Loss 190.556411743 Time 0.981
Epoch 1 Batch 225 Loss 259.509933337 Last Loss 157.736297607 Time 0.985
Epoch 1 Batch 226 Loss 259.273305347 Last Loss 205.795379639 Time 0.982
Epoch 1 Batch 227 Loss 259.062388972 Last Loss 211.184371948 Time 0.991
Epoch 1 Batch 228 Loss 258.901376783 Last Loss 222.190597534 Time 0.984
Epoch 1 Batch 229 Loss 258.643836975 Last Loss 199.667221069 Time 0.986
Epoch 1 Batch 230 Loss 258.344224079 Last Loss 189.433258057 Time 0.979
After a while it will make it through the slow phase and return to being 1s per batch. I am using Ubuntu 16.04 with CUDNN 5 and Cuda 8. This problem is intermittent and I have no real way to reproduce it.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 2
- Comments: 16 (7 by maintainers)
Just to add my 2 cents; I’m running into a similar issue. My setup includes 4 Titan X GPUs and plenty of CPU/RAM. I’m training two models at a time (each using two GPUs). One model has been training for a long time with consistent times per step. I then start up the second model and things look good for about 1500-3000 steps and both models have consistent training times per step. After a while both models slow down considerably with sporadic training times. I then kill my second model and the first one goes back to fast, consistent time per step. Can’t figure out what the problem is, and both my CPU and RAM are not fully utilized.