tensorflow: Running tensorflow on GPU is far slower than on CPU

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 1809 & Windows Server 2016
  • TensorFlow installed from (source or binary): pip install tensorflow-gpu==2.0.0-beta1, as well as tensorflow-gpu, compared to tensorflow & tensorflow==2.0.0-beta1
  • Python version: 3.6
  • Bazel version (if compiling from source): -
  • GCC/Compiler version (if compiling from source): -
  • CUDA/cuDNN version: Cuda 10 & cuDNN 7.6.2 for cuda 10.0

Current behavior:

Im getting a 50%+ performance loss with GPU! In the below example, the CPU version is even training way faster on a bigger model with slightly bigger epochs. time

Im training on 2 different systems: My server, without a GPU:

Intel Core i5 6500T (4x@2.5 ghz) Notebook processor Network attached storage for training data and output 16 Gb DDR 4 Ram My Desktop PC:

Intel core i7 3770k (4x3,5-4 ghz) Nvidia GTX 970 @4GB 32gb ddr3 ram Training Data on local SSD, output to NAS

Now interestingly 3000 epochs, 100000 records each, takes roughly 3h on the server using TF 1.14 The same on my Desktop with GPU takes 8h with TF 2.0 It sits with full Video Ram but at 3% graphical processor use. The CPU is sometimes at 30% use with tensorflow GPU but 100% at any time with any CPU build.
The harddisk is utilized with a whopping 0%. system utilisation

Expected behavior:

Tensorflow-GPU trains faster than Tensorflow CPU

Code to reproduce the issue:

My model is a fairly simple keras sequential lstm:

for learningrate in learningrates:
	for layerdensity in layerdensitys:
		for layer in amount_of_layers:
			################################
			# generate model               #
			################################
			modelname = f"{layer}-layer_{layerdensity}-nodes_selu-adam_{learningrate}-learningrate_{records_per_epoch}-epochsize_{appendix}"
			model = keras.Sequential()
			model.add(Dense(layerdensity, activation=tf.nn.selu, input_dim=15))
			for i in range(layer-1):
				model.add(Dense(layerdensity, activation=tf.nn.selu))
			model.add(Dense(9,activation=tf.nn.softmax, name = "Output"))
			# Compile
			optimizer = tf.keras.optimizers.Adam(lr=learningrate)
			model.compile(
				optimizer=optimizer,
				loss='sparse_categorical_crossentropy',
				metrics=['accuracy'])
			model.summary()
			tensorboard = TensorBoard(log_dir="\\\\drg-fs01\\BigData\\Projects\\Notebooks\\PokerBot\\log\\" + modelname,
				histogram_freq = 100, write_graph = False)
			#cp_callback = tf.keras.callbacks.ModelCheckpoint("\\\\drg-fs01\\BigData\\Projects\\Notebooks\\PokerBot\\checkpoints\\" + modelname, verbose=0)
			################################
			# train model                  #
			################################
			model.fit(trainSet, 
				epochs = epochs, 
				steps_per_epoch = trainSteps, 
				shuffle = True, 
				validation_data = testSet, 
				validation_steps = testSteps, 
				validation_freq = int(epochs/maxTestEpochs),
				verbose = verbose, 
				callbacks = [tensorboard])#,cp_callback])
			model.save(basePath+'saved_models/' + modelname + '.h5')

I am having lots of test data which does not fit in memory so I use interleaved datasets:

# dataset modeler
def modelDataset(sourcepath, badgesize, repeat = False, repetitions = 10):
    #get all files
    list = os.listdir(sourcepath)
    pathfiles = [sourcepath+x for x in list]
    
    #get metrics
    rows_per_file = count_lines(sourcepath+"0.csv")
    number_of_files = len(list)
    total_rows = (rows_per_file * number_of_files)
    print(f"records: {total_rows}")
    # get number of steps per Epoch
    steps_per_epoch = int(rows_per_file / badgesize) # 2000 badges per epoch
    epochs = number_of_files
    if badgesize == 1:
        epochs = 1 
    print(f"number epochs: {epochs}")
    # model interleaved dataset
    dataset = (tf.data.Dataset.from_tensor_slices(pathfiles).interleave(lambda x:
        tf.data.TextLineDataset(x).map(parse_csv, num_parallel_calls=4),
        cycle_length=4, block_length=16))
    dataset.columns = CSV_COLUMNS
    
    if badgesize != 1:
        dataset = dataset.shuffle(buffer_size=badgesize)
    if repeat:
        dataset = dataset.repeat(repetitions)
        epochs = epochs * repetitions
    dataset = dataset.batch(badgesize)
    dataset = dataset.prefetch(2)  # prefetch one batch
    return dataset, steps_per_epoch, epochs, badgesize

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 4
  • Comments: 15 (3 by maintainers)

Most upvoted comments

I’ve just spent half a day, trying to get through the flimsy installation of GPU support for Windows 10, and when I finally got it all working, I got my GTX 780 performing about 10 times worse than my CPU. That’s bad.

I think that your CPU performance is better than GPU performance because you have a relatively small model. You should try a test run with amount_of_layers = [5, 50, 150] And see if GPU is still performing slower than CPU.

As a poster on S.O. says, “the overhead of invoking GPU kernels, and copying data to and from GPU, is very high. For operations on models with very little parameters it is not worth of using GPU” Related StackOverflow Post

Here you can find a very usefull tutorial. I faced the same problem, and that post really helped. The issue was in cudnn- and cuda- conda driver versions incompatibility. Hope, this will help

@esslushy Really nothing jumps to my mind. First, I thought it might be the IO on network so I pulled 50GB of data locally grr. didnt help. Having Drive and Network at Literally 0% is a headscratcher to me. Next, I thought it might be the parse function:

# csv to tensor parser
def parse_csv(line):
        parsed_line = tf.io.decode_csv(line, [[0.], [0.], [0.],[0.], [0.], [0.], [0.],[0.], [0.], [0.], [0.],[0.], [0.], [0.], [0.], [0]])
        label = parsed_line[-1:]
        del parsed_line[-1]
        features = parsed_line
        return tf.stack(features),label

When preparing the data with my own c# Program, I can shuffle, augment data and write to disk roughly 1ep/3secs (100’000 records) but when training, I can get 1 step in 2.45s So if this function is not way, way slower than my simple c$ code, that should not be too much of an issue. Would be great if there was a way to debug this part. But im very fresh to python.

So I’m left with “why would it be faster on a slower cpu where it has to parse and train on the same chip than on cpu + gpu?”