tensorflow: Random slow downs during training.

Usually the speed of each training batch is somewhat consistent for me when using tensorflow. However, I am currently training an RNN. For the first 3 epochs each batch of 128 samples took about 1 second. Now on the third epoch my times have become inconsistent. The batches will randomly become very slow. I can hear the fan on my GTX Titian spin down indicating that during these slowdowns it is not using the compute resources on the GPU.

I am timing it like this so I don’t think anything else in my code could be affecting it.

            begin_time = time.time()
            loss, ts = sess.run([cost, train_step], feed_dict = {input_tensor: x_train, expected_output: y_train, keep_prob: 0.8})
            end_time = time.time()

Here is the output with timing information.

Epoch 3        Batch 858      Loss 172.072438250   Last Loss 206.985626221   Time 1.432 
Epoch 3        Batch 859      Loss 172.067967827   Last Loss 168.227874756   Time 1.419 
Epoch 3        Batch 860      Loss 172.057925642   Last Loss 163.421646118   Time 1.447 
Epoch 3        Batch 861      Loss 172.056937587   Last Loss 171.206222534   Time 1.339 
Epoch 3        Batch 862      Loss 172.051488565   Last Loss 167.354431152   Time 1.285 
Epoch 3        Batch 863      Loss 172.016926642   Last Loss 142.189987183   Time 1.310 
Epoch 3        Batch 864      Loss 172.011091517   Last Loss 166.969543457   Time 1.291 
Epoch 3        Batch 865      Loss 172.011779468   Last Loss 172.606857300   Time 1.317 
Epoch 3        Batch 866      Loss 172.004100742   Last Loss 165.354324341   Time 1.307 
Epoch 3        Batch 867      Loss 172.008801860   Last Loss 176.084671021   Time 1.372 
Epoch 3        Batch 868      Loss 172.032961320   Last Loss 193.003372192   Time 1.298 
Epoch 3        Batch 869      Loss 172.036121868   Last Loss 174.782638550   Time 1.310 
Epoch 3        Batch 870      Loss 172.044807513   Last Loss 179.601318359   Time 1.429 
Epoch 3        Batch 871      Loss 172.066208551   Last Loss 190.706512451   Time 1.311 
Epoch 3        Batch 872      Loss 172.052568940   Last Loss 160.158828735   Time 2.614 
Epoch 3        Batch 873      Loss 172.032694941   Last Loss 154.682693481   Time 2.208 
Epoch 3        Batch 874      Loss 172.040674805   Last Loss 179.015075684   Time 3.138 
Epoch 3        Batch 875      Loss 172.034015673   Last Loss 166.207275391   Time 1.750 
Epoch 3        Batch 876      Loss 172.029174909   Last Loss 167.788665771   Time 1.955 
Epoch 3        Batch 877      Loss 172.042762183   Last Loss 183.958801270   Time 2.576 
Epoch 3        Batch 878      Loss 172.028727179   Last Loss 159.705993652   Time 3.130 
Epoch 3        Batch 879      Loss 172.035255710   Last Loss 177.773834229   Time 2.622 
Epoch 3        Batch 880      Loss 172.043415273   Last Loss 179.223831177   Time 1.581 
Epoch 3        Batch 881      Loss 172.040243793   Last Loss 169.246170044   Time 3.093 
Epoch 3        Batch 882      Loss 172.017778805   Last Loss 152.203659058   Time 2.470 
Epoch 3        Batch 883      Loss 172.018503957   Last Loss 172.658813477   Time 2.540 
Epoch 3        Batch 884      Loss 172.054275064   Last Loss 203.675933838   Time 2.597 
Epoch 3        Batch 885      Loss 172.029123448   Last Loss 149.769943237   Time 2.915 
Epoch 3        Batch 886      Loss 172.013957027   Last Loss 158.576507568   Time 3.095 
Epoch 3        Batch 887      Loss 171.994215424   Last Loss 154.483413696   Time 2.250 
Epoch 3        Batch 888      Loss 171.998252946   Last Loss 175.583572388   Time 2.997 
Epoch 3        Batch 889      Loss 171.974352959   Last Loss 150.727264404   Time 3.417 
Epoch 3        Batch 890      Loss 171.968166587   Last Loss 166.462295532   Time 2.290 
Epoch 3        Batch 891      Loss 172.011782095   Last Loss 210.873199463   Time 1.358 
Epoch 3        Batch 892      Loss 172.013166695   Last Loss 173.248229980   Time 0.910 
Epoch 3        Batch 893      Loss 172.016952293   Last Loss 175.397491455   Time 0.893 
Epoch 3        Batch 894      Loss 172.015184030   Last Loss 170.434356689   Time 0.986 
Epoch 3        Batch 895      Loss 172.001527531   Last Loss 159.778961182   Time 1.000 
Epoch 3        Batch 896      Loss 172.008338426   Last Loss 178.110900879   Time 0.999 
Epoch 3        Batch 897      Loss 172.015577083   Last Loss 178.508651733   Time 1.699 
Epoch 3        Batch 898      Loss 172.076713689   Last Loss 226.977386475   Time 1.570 
Epoch 3        Batch 899      Loss 172.043461711   Last Loss 142.149932861   Time 1.699 
Epoch 3        Batch 900      Loss 172.069320933   Last Loss 195.342620850   Time 1.678 
Epoch 3        Batch 901      Loss 172.063626562   Last Loss 166.932998657   Time 1.685 
Epoch 3        Batch 902      Loss 172.060249506   Last Loss 169.014144897   Time 1.467 
Epoch 3        Batch 903      Loss 172.068654811   Last Loss 179.658645630   Time 3.737 
Epoch 3        Batch 904      Loss 172.061790508   Last Loss 165.856460571   Time 3.705 
Epoch 3        Batch 905      Loss 172.078748707   Last Loss 187.425918579   Time 3.376 
Epoch 3        Batch 906      Loss 172.065538518   Last Loss 160.097106934   Time 4.606 
Epoch 3        Batch 907      Loss 172.040349112   Last Loss 149.193557739   Time 4.513 
Epoch 3        Batch 908      Loss 172.078197416   Last Loss 206.444458008   Time 2.491 
Epoch 3        Batch 909      Loss 172.081030156   Last Loss 174.655990601   Time 4.244 

Here is what the output normally looks like most of the time when this issue is not occurring.

Epoch 1        Batch 213      Loss 262.820208541   Last Loss 189.761398315   Time 0.978 
Epoch 1        Batch 214      Loss 262.570083973   Last Loss 209.043426514   Time 0.984 
Epoch 1        Batch 215      Loss 262.265294534   Last Loss 196.735565186   Time 0.985 
Epoch 1        Batch 216      Loss 261.973181588   Last Loss 198.876785278   Time 0.988 
Epoch 1        Batch 217      Loss 261.685996729   Last Loss 199.366882324   Time 0.989 
Epoch 1        Batch 218      Loss 261.385472058   Last Loss 195.871093750   Time 0.981 
Epoch 1        Batch 219      Loss 261.200773135   Last Loss 220.751708984   Time 0.983 
Epoch 1        Batch 220      Loss 260.912834202   Last Loss 197.566268921   Time 0.982 
Epoch 1        Batch 221      Loss 260.757691254   Last Loss 226.471099854   Time 0.981 
Epoch 1        Batch 222      Loss 260.588221186   Last Loss 222.965866089   Time 0.984 
Epoch 1        Batch 223      Loss 260.272108146   Last Loss 189.778900146   Time 0.992 
Epoch 1        Batch 224      Loss 259.962260607   Last Loss 190.556411743   Time 0.981 
Epoch 1        Batch 225      Loss 259.509933337   Last Loss 157.736297607   Time 0.985 
Epoch 1        Batch 226      Loss 259.273305347   Last Loss 205.795379639   Time 0.982 
Epoch 1        Batch 227      Loss 259.062388972   Last Loss 211.184371948   Time 0.991 
Epoch 1        Batch 228      Loss 258.901376783   Last Loss 222.190597534   Time 0.984 
Epoch 1        Batch 229      Loss 258.643836975   Last Loss 199.667221069   Time 0.986 
Epoch 1        Batch 230      Loss 258.344224079   Last Loss 189.433258057   Time 0.979 

After a while it will make it through the slow phase and return to being 1s per batch. I am using Ubuntu 16.04 with CUDNN 5 and Cuda 8. This problem is intermittent and I have no real way to reproduce it.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 2
  • Comments: 16 (7 by maintainers)

Most upvoted comments

Just to add my 2 cents; I’m running into a similar issue. My setup includes 4 Titan X GPUs and plenty of CPU/RAM. I’m training two models at a time (each using two GPUs). One model has been training for a long time with consistent times per step. I then start up the second model and things look good for about 1500-3000 steps and both models have consistent training times per step. After a while both models slow down considerably with sporadic training times. I then kill my second model and the first one goes back to fast, consistent time per step. Can’t figure out what the problem is, and both my CPU and RAM are not fully utilized.