tensorflow: Tensorflow nearly 2 times slower than Pytorch in training!

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: -
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.7.0 and 1.10.1
Python version: 3.6
Bazel version (if compiling from source):-
GCC/Compiler version (if compiling from source):-
CUDA/cuDNN version:9.0/7.0 and 7.1
GPU model and memory: GTX1080/8Gig
Exact command to reproduce:

Describe the problem

I was trying to train the same network on both Tensorflow and Pytorch frameworks and noticed Tensorflow is nearly 2x slower than Pytorch!
While each epoch in Pytorch takes roughly about 50 seconds, in Tensorflow it takes 90 seconds!
I initially tried Stackoverflow, but after seeing a similar issue (https://github.com/tensorflow/tensorflow/issues/7187)from around a year ago, which reported the same performance issues and ultimately got solved, I changed my mind and thought it should be a bug.
I uploaded the code snippets regarding Tensorflow (which are taken from official TF model repository-resnet and changed to be used with the new architecture(nearly everything is intact except minor changes needed to use the new architecture (simpleNet, which is a very simple convolutional architecture)).
It should be noted that, the points in the previous similar issue that solved the performance discrepancy are not valid in this one. since the code base is changed and has been updated according to the latest improvements in TF. Things such as not using feed_dict, or offloading the input on to the CPU instead of GPU and setting TF_ENABLE_WINOGRAD_NONFUSED flags are already addressed in the new example which is used for this case, however the performance issue still persists.

Source code / logs

Here are the code snippets :

simple_model.py
cifar10_main.py
simplenet_run_loop.py

Here are the logs for Pytorch and tensorflow respectively :

Pytorch (v0.4) :

==>>[2018-08-19 00:00:26] [Epoch=000/450] [Need: 00:00:00] [learning_rate=0.100000] [Best : Accuracy=0.00, Error=100.00]
  Epoch: [000][000/500]   Time 1.345 (1.345)   Data 0.089 (0.089)   Loss 4.8846 (4.8846)   Prec@1 0.000 (0.000)   Prec@5 4.000 (4.000)   [2018-08-19 00:00:28]
  Epoch: [000][200/500]   Time 0.089 (0.096)   Data 0.000 (0.001)   Loss 4.0047 (4.3586)   Prec@1 6.000 (3.771)   Prec@5 28.000 (14.930)   [2018-08-19 00:00:46]
  Epoch: [000][400/500]   Time 0.089 (0.093)   Data 0.000 (0.000)   Loss 3.9328 (4.1781)   Prec@1 9.000 (5.519)   Prec@5 26.000 (20.142)   [2018-08-19 00:01:04]
  **Train** Prec@1 6.352 Prec@5 22.334 Error@1 93.648
  **Test** Prec@1 8.520 Prec@5 31.600 Error@1 91.480

==>>[2018-08-19 00:01:17] [Epoch=001/450] [Need: 06:07:54] [learning_rate=0.100000] [Best : Accuracy=8.52, Error=91.48]
  Epoch: [001][000/500]   Time 0.128 (0.128)   Data 0.086 (0.086)   Loss 3.7810 (3.7810)   Prec@1 9.000 (9.000)   Prec@5 34.000 (34.000)   [2018-08-19 00:01:17]
  Epoch: [001][200/500]   Time 0.090 (0.090)   Data 0.000 (0.001)   Loss 3.5385 (3.7109)   Prec@1 18.000 (11.517)   Prec@5 39.000 (34.861)   [2018-08-19 00:01:35]
  Epoch: [001][400/500]   Time 0.088 (0.090)   Data 0.000 (0.000)   Loss 3.6088 (3.6151)   Prec@1 11.000 (13.274)   Prec@5 34.000 (38.102)   [2018-08-19 00:01:54]
  **Train** Prec@1 14.048 Prec@5 39.416 Error@1 85.952
  **Test** Prec@1 19.110 Prec@5 45.950 Error@1 80.890

==>>[2018-08-19 00:02:07] [Epoch=002/450] [Need: 06:10:38] [learning_rate=0.100000] [Best : Accuracy=19.11, Error=80.89]
  Epoch: [002][000/500]   Time 0.133 (0.133)   Data 0.086 (0.086)   Loss 3.4438 (3.4438)   Prec@1 17.000 (17.000)   Prec@5 45.000 (45.000)   [2018-08-19 00:02:07]
  Epoch: [002][200/500]   Time 0.089 (0.091)   Data 0.000 (0.001)   Loss 3.1025 (3.2688)   Prec@1 26.000 (19.721)   Prec@5 56.000 (48.085)   [2018-08-19 00:02:26]
  Epoch: [002][400/500]   Time 0.092 (0.091)   Data 0.000 (0.000)   Loss 2.9271 (3.1983)   Prec@1 24.000 (20.998)   Prec@5 57.000 (50.125)   [2018-08-19 00:02:44]
  **Train** Prec@1 21.658 Prec@5 50.980 Error@1 78.342
  **Test** Prec@1 26.430 Prec@5 56.030 Error@1 73.570

==>>[2018-08-19 00:02:57] [Epoch=003/450] [Need: 06:10:40] [learning_rate=0.100000] [Best : Accuracy=26.43, Error=73.57]
  Epoch: [003][000/500]   Time 0.136 (0.136)   Data 0.087 (0.087)   Loss 2.8432 (2.8432)   Prec@1 23.000 (23.000)   Prec@5 55.000 (55.000)   [2018-08-19 00:02:57]
  Epoch: [003][200/500]   Time 0.092 (0.091)   Data 0.000 (0.001)   Loss 2.8233 (2.8715)   Prec@1 33.000 (26.567)   Prec@5 62.000 (58.433)   [2018-08-19 00:03:16]
  Epoch: [003][400/500]   Time 0.092 (0.091)   Data 0.000 (0.000)   Loss 2.6975 (2.8040)   Prec@1 29.000 (28.047)   Prec@5 58.000 (60.125)   [2018-08-19 00:03:34]
  **Train** Prec@1 28.784 Prec@5 60.932 Error@1 71.216
  **Test** Prec@1 30.960 Prec@5 61.810 Error@1 69.040

Tensorflow( v1.7.0) :

totalMemory: 7.93GiB freeMemory: 7.38GiB
2018-08-26 09:06:38.766280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-08-26 09:06:38.929447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-26 09:06:38.929480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-08-26 09:06:38.929484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-08-26 09:06:38.929650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/device:GPU:0 with 7131 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Benchmark run: {'model_name': 'simpnet', 'machine_config': {'cpu_info': {'num_cores': 8, 'cpu_info': 'Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz', 'mhz_per_cpu': 4000.0}, 'gpu_info': {'count': 1, 'model': 'GeForce GTX 1080'}, 'memory_total': 20986626048, 'memory_available': 14860541952}, 'run_date': '2018-08-26T04:36:38.596702Z', 'tensorflow_version': {'version': '1.7.0', 'git_hash': 'v1.7.0-3-g024aecf414'}, 'tensorflow_environment_variables': [{'name': 'TF_ENABLE_WINOGRAD_NONFUSED', 'value': '1'}]}
INFO:tensorflow:Starting a training cycle: 0/250
INFO:tensorflow:Calling model_fn.
data_format:  channels_first
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-08-26 09:06:40.685552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-08-26 09:06:40.685610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-26 09:06:40.685617: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-08-26 09:06:40.685630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-08-26 09:06:40.685744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7131 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/cifar10_model/model.ckpt.
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 7.7181897, train_accuracy = 0.09
INFO:tensorflow:loss = 20.531982, step = 0
INFO:tensorflow:global_step/sec: 5.5106
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 2.26069, train_accuracy = 0.105 (18.147 sec)
INFO:tensorflow:loss = 13.960868, step = 100 (18.147 sec)
INFO:tensorflow:global_step/sec: 5.42846
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 2.291137, train_accuracy = 0.10666667 (18.421 sec)
INFO:tensorflow:loss = 12.874262, step = 200 (18.421 sec)
INFO:tensorflow:global_step/sec: 5.62853
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 2.1555672, train_accuracy = 0.1175 (17.767 sec)
INFO:tensorflow:loss = 11.73109, step = 300 (17.767 sec)
INFO:tensorflow:global_step/sec: 5.45977
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 2.0706115, train_accuracy = 0.122 (18.316 sec)
INFO:tensorflow:loss = 10.739094, step = 400 (18.316 sec)
INFO:tensorflow:Saving checkpoints for 500 into /tmp/cifar10_model/model.ckpt.
INFO:tensorflow:Loss for final step: 9.72249.
INFO:tensorflow:Starting to evaluate.
INFO:tensorflow:Calling model_fn.
data_format:  channels_first
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-08-26-04:38:16
INFO:tensorflow:Graph was finalized.
2018-08-26 09:08:16.838937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-08-26 09:08:16.838968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-26 09:08:16.838986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-08-26 09:08:16.838989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-08-26 09:08:16.839099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7131 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /tmp/cifar10_model/model.ckpt-500
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-08-26-04:38:22
INFO:tensorflow:Saving dict for global step 500: accuracy = 0.2181, global_step = 500, loss = 9.815871
INFO:tensorflow:Benchmark metric: Name accuracy, value 0, unit None, global_step 500, extras []
INFO:tensorflow:Benchmark metric: Name loss, value 9, unit None, global_step 500, extras []
INFO:tensorflow:Starting a training cycle: 1/250
INFO:tensorflow:Calling model_fn.
data_format:  channels_first
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-08-26 09:08:23.994577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-08-26 09:08:23.994610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-26 09:08:23.994625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-08-26 09:08:23.994630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-08-26 09:08:23.994726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7131 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /tmp/cifar10_model/model.ckpt-500
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 501 into /tmp/cifar10_model/model.ckpt.
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 1.831348, train_accuracy = 0.23
INFO:tensorflow:loss = 9.679985, step = 500
INFO:tensorflow:global_step/sec: 5.3539
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 1.8214886, train_accuracy = 0.25 (18.678 sec)
INFO:tensorflow:loss = 8.9299, step = 600 (18.678 sec)
INFO:tensorflow:global_step/sec: 5.49204
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 1.7747709, train_accuracy = 0.26666668 (18.208 sec)
INFO:tensorflow:loss = 8.215595, step = 700 (18.208 sec)
INFO:tensorflow:global_step/sec: 5.48954
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 1.7076583, train_accuracy = 0.275 (18.216 sec)
INFO:tensorflow:loss = 7.5460052, step = 800 (18.216 sec)
INFO:tensorflow:global_step/sec: 5.65214
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 1.6106825, train_accuracy = 0.286 (17.692 sec)
INFO:tensorflow:loss = 6.904671, step = 900 (17.692 sec)
INFO:tensorflow:Saving checkpoints for 1000 into /tmp/cifar10_model/model.ckpt.
INFO:tensorflow:Loss for final step: 6.5854077.
INFO:tensorflow:Starting to evaluate.
INFO:tensorflow:Calling model_fn.
data_format:  channels_first
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-08-26-04:39:57
INFO:tensorflow:Graph was finalized.
2018-08-26 09:09:57.431671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-08-26 09:09:57.431702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-26 09:09:57.431718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-08-26 09:09:57.431722: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-08-26 09:09:57.431814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7131 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /tmp/cifar10_model/model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-08-26-04:40:02
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.264, global_step = 1000, loss = 6.684446
INFO:tensorflow:Benchmark metric: Name accuracy, value 0, unit None, global_step 1000, extras []
INFO:tensorflow:Benchmark metric: Name loss, value 6, unit None, global_step 1000, extras []
INFO:tensorflow:Starting a training cycle: 2/250
INFO:tensorflow:Calling model_fn.
data_format:  channels_first
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-08-26 09:10:04.247883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-08-26 09:10:04.247914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-26 09:10:04.247930: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-08-26 09:10:04.247934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-08-26 09:10:04.248028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7131 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /tmp/cifar10_model/model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1001 into /tmp/cifar10_model/model.ckpt.
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 1.7071365, train_accuracy = 0.32
INFO:tensorflow:loss = 6.5100193, step = 1000
INFO:tensorflow:global_step/sec: 5.54579
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 1.5869155, train_accuracy = 0.335 (18.032 sec)
INFO:tensorflow:loss = 5.9459677, step = 1100 (18.032 sec)
INFO:tensorflow:global_step/sec: 5.69552
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 1.754588, train_accuracy = 0.31 (17.558 sec)
INFO:tensorflow:loss = 5.7136836, step = 1200 (17.558 sec)
INFO:tensorflow:global_step/sec: 5.69235
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 1.6495434, train_accuracy = 0.3225 (17.568 sec)
INFO:tensorflow:loss = 5.2463627, step = 1300 (17.568 sec)
INFO:tensorflow:global_step/sec: 5.60754
INFO:tensorflow:learning_rate = 0.1, cross_entropy = 1.6227012, train_accuracy = 0.326 (17.833 sec)
INFO:tensorflow:loss = 4.8923674, step = 1400 (17.833 sec)
INFO:tensorflow:Saving checkpoints for 1500 into /tmp/cifar10_model/model.ckpt.
INFO:tensorflow:Loss for final step: 4.7713437.
INFO:tensorflow:Starting to evaluate.
INFO:tensorflow:Calling model_fn.
data_format:  channels_first
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-08-26-04:41:36
INFO:tensorflow:Graph was finalized.
2018-08-26 09:11:36.521245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-08-26 09:11:36.521277: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-26 09:11:36.521285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-08-26 09:11:36.521290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-08-26 09:11:36.521385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7131 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /tmp/cifar10_model/model.ckpt-1500
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-08-26-04:41:41
INFO:tensorflow:Saving dict for global step 1500: accuracy = 0.3168, global_step = 1500, loss = 4.7169304
INFO:tensorflow:Benchmark metric: Name accuracy, value 0, unit None, global_step 1500, extras []
INFO:tensorflow:Benchmark metric: Name loss, value 4, unit None, global_step 1500, extras []
INFO:tensorflow:Starting a training cycle: 3/250
INFO:tensorflow:Calling model_fn.
data_format:  channels_first
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.

Extra info :
Os: Ubuntu 16.04
GPU: GTX1080
CPU: Intel 4790K
RAM : 20Gig
GPU Utilization : 99%

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 23 (12 by maintainers)

Most upvoted comments

@Coderx7 would you double-check that this script below has the same architecture as your “simpnet” (https://github.com/Coderx7/TF_Pytorch_testbed/blob/master/Pytorch/models/simpnet.py), uses the same input size of 32 and batch size of 100 ?

code


#!/usr/bin/env python
# -*- coding: utf-8 -*-
import tensorflow as tf
import argparse
import os

from tensorpack import *
from tensorpack.tfutils.summary import *
from tensorpack.dataflow import dataset


class Model(ModelDesc):
    def __init__(self, cifar_classnum):
        super(Model, self).__init__()
        self.cifar_classnum = cifar_classnum

    def inputs(self):
        return [tf.placeholder(tf.float32, (None, 32, 32, 3), 'input'),
                tf.placeholder(tf.int32, (None,), 'label')]

    def build_graph(self, image, label):
        is_training = get_current_tower_context().is_training

        image = tf.transpose(image, [0, 3, 1, 2])
        data_format = 'channels_first'

        with argscope(Conv2D, activation=BNReLU, use_bias=False, kernel_size=3), \
                argscope([Conv2D, MaxPooling, BatchNorm, GlobalAvgPooling], data_format=data_format):
            logits = LinearWrap(image) \
                .Conv2D('conv1.1', filters=66) \
                .Conv2D('conv1.2', filters=128) \
                .Conv2D('conv2.1', filters=128) \
                .Conv2D('conv2.2', filters=128) \
                .Conv2D('conv2.3', filters=192) \
                .MaxPooling('pool2', 2, stride=2, padding='SAME') \
                .tf.nn.dropout(0.95) \
                .Conv2D('conv4.1', filters=192) \
                .Conv2D('conv4.2', filters=192) \
                .Conv2D('conv4.3', filters=192) \
                .Conv2D('conv4.4', filters=192) \
                .Conv2D('conv4.5', filters=288) \
                .MaxPooling('pool2', 2, stride=2, padding='SAME') \
                .tf.nn.dropout(0.95) \
                .Conv2D('conv5.1', filters=288) \
                .Conv2D('conv5.2', filters=355) \
                .Conv2D('conv5.3', filters=432) \
                .GlobalAvgPooling('gap') \
                .FullyConnected('linear', out_dim=self.cifar_classnum)()

        cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label)
        cost = tf.reduce_mean(cost, name='cross_entropy_loss')

        correct = tf.to_float(tf.nn.in_top_k(logits, label, 1), name='correct')
        # monitor training error
        add_moving_summary(tf.reduce_mean(correct, name='accuracy'))

        # weight decay on all W of fc layers
        wd_cost = regularize_cost('fc.*/W', l2_regularizer(4e-4), name='regularize_loss')
        add_moving_summary(cost, wd_cost)

        add_param_summary(('.*/W', ['histogram']))   # monitor W
        return tf.add_n([cost, wd_cost], name='cost')

    def optimizer(self):
        lr = tf.get_variable('learning_rate', initializer=1e-2, trainable=False)
        tf.summary.scalar('lr', lr)
        return tf.train.AdamOptimizer(lr, epsilon=1e-3)


def get_data(train_or_test, cifar_classnum):
    isTrain = train_or_test == 'train'
    if cifar_classnum == 10:
        ds = dataset.Cifar10(train_or_test)
    else:
        ds = dataset.Cifar100(train_or_test)
    if isTrain:
        augmentors = [
            imgaug.RandomCrop((32, 32)),
            imgaug.Flip(horiz=True),
            imgaug.Brightness(63),
            imgaug.Contrast((0.2, 1.8)),
            imgaug.MeanVarianceNormalize(all_channel=True)
        ]
    else:
        augmentors = [
            imgaug.CenterCrop((32, 32)),
            imgaug.MeanVarianceNormalize(all_channel=True)
        ]
    ds = AugmentImageComponent(ds, augmentors)
    ds = BatchData(ds, 100, remainder=not isTrain)
    if isTrain:
        ds = PrefetchDataZMQ(ds, 5)
    return ds


def get_config(cifar_classnum):
    # prepare dataset
    dataset_train = get_data('train', cifar_classnum)
    dataset_test = get_data('test', cifar_classnum)
    return TrainConfig(
        model=Model(cifar_classnum),
        data=QueueInput(dataset_train),
        callbacks=[
            ModelSaver(),
            InferenceRunner(dataset_test,
                            ScalarStats(['accuracy', 'cost'])),
        ],
        max_epoch=150,
    )


if __name__ == '__main__':
    with tf.Graph().as_default():
        logger.set_logger_dir(os.path.join('train_log', 'cifar' + str(10)))
        config = get_config(10)

        trainer = SimpleTrainer()
        launch_train_with_config(config, trainer)

I only copied an existing cifar10 training script and modify the relevant parts. It would take you some time but at least it’s only 100 lines to read. https://github.com/Coderx7/TF_Pytorch_testbed/tree/master/TF/simpnet has much more lines of code for me to read on the other hand.

You can run it by pip install git+https://github.com/tensorpack/tensorpack.git; python thisfile.py. On my machine it does run faster (30 seconds per epoch) than your pytorch code (37 seconds with bash train_cifar10.sh), ignoring the first epoch.

ppwwyyxx on Aug 31, 2018

HI there, I just found that using use_bias=False in a very small LSTM network caused SIGNIFICANT slowdowns. Possibly related? (Tf2) (2 layers LSTM, went from 10s to 9 minutes per epoch)

worthy7 on Mar 13, 2019

@ikostrikov Can you please open a new issue? It is frustrating that you are treating this issue like an open forum. This is kind of rude to Coderx7 as it takes away from discussion from the issue.

@Coderx7 I monitor and the ImageNet pipeline in this example but I do not do much with the CIFAR10 aspect. At 50 seconds or even 2 minutes for an epoch it is interesting but with a large number of things to sort out it does not bubble to the top as doing this deeply means not doing something else. We are working on the pipeline and I will try to start monitoring both cifar10 and imagennet performance going forward. It also may not be the input pipeline at all but I like to start there as that has a big impact.

tfboyd on Oct 23, 2018

I will try to dig in but I suspect my crazy guess is that checkpoints will be a big part of it. For the eval Estimator writes a checkpoint and then reads it back in, this would be really noticeable for fast epochs. Just a guess and I hope I find time to add in some print statements and timings to break the timing out.

And if that is true, I also find that a bit frustrating and that is why I am interested as if I don’t want that I would like the option to turn it off. 😃 With pure session.run you would not have that or if using Eager.

And Thank you for the nice sandbox. I know it is work but it gives me a clean place to work knowing it is what you are using.

tfboyd on Aug 31, 2018