tensorflow: Boolean operations on GPU are extremely slow

ArchLinux. Cuda7.5. NIghtly TF built for Python2.

import tensorflow as tf
import time

v = tf.get_variable('test', shape=[100, 100, 100])
vb = tf.get_variable('test2', shape=[100, 100, 100], dtype=tf.bool,
        initializer=tf.constant_initializer(False))

b1 = tf.reduce_sum(v)
b2 = tf.reduce_all(vb)
b3 = tf.reduce_all(tf.cast(v, tf.bool))

sess = tf.Session()
sess.run(tf.initialize_all_variables())
with sess.as_default():
    start = time.time()
    for k in range(100):
        sess.run(b1)
    print time.time() - start   # 0.02s

    start = time.time()
    for k in range(100):
        sess.run(b2)
    print time.time() - start   # 7s!

    start = time.time()
    for k in range(100):
        sess.run(b3)
    print time.time() - start   # 17s!

CPU version of the same operation is also much faster than this.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 3
  • Comments: 15 (14 by maintainers)

Most upvoted comments

This is now fixed internally. Boolean reductions will be approximately 1000x faster now.

First, thanks for reporting this issue. There is something going on that looks like a possible bug.

Second, writing good performance benchmarks that measure what you intend, and not something else, is hard. I’m still not completely confident I’ve gotten it right. However, I’m going to suggest an alternative benchmark program, then explain why. It’s not precisely what I run on my system, and you may need to mess with the python a bit to get it to run on yours.

benchmark.txt

My CPU is a dual-socket Haswell, and my GPU is a GTX Titan X. Here’s the output I get:

dev cpu size 1024 logical and: 0.002729 dev cpu size 1024 integer add: 0.002192 dev cpu size 1024 float32 add: 0.002473 dev cpu size 1024 logical red: 0.004275 dev cpu size 1024 integer red: 0.005276 dev cpu size 1024 float32 red: 0.004385

dev gpu size 1024 logical and: 0.068111 dev gpu size 1024 integer add: 0.004291 dev gpu size 1024 float32 add: 0.003737 dev gpu size 1024 logical red: 1.761069 <<<< ANOMALY dev gpu size 1024 integer red: 0.007658 dev gpu size 1024 float32 red: 0.006458

dev cpu size 1048576 logical and: 0.018126 dev cpu size 1048576 integer add: 0.035838 dev cpu size 1048576 float32 add: 0.039299 dev cpu size 1048576 logical red: 0.027802 dev cpu size 1048576 integer red: 0.048078 dev cpu size 1048576 float32 red: 0.053308

dev gpu size 1048576 logical and: 0.007452 dev gpu size 1048576 integer add: 0.015339 dev gpu size 1048576 float32 add: 0.010211 dev gpu size 1048576 logical red: 0.009259 dev gpu size 1048576 integer red: 0.021310 dev gpu size 1048576 float32 red: 0.011330

dev cpu size 10485760 logical and: 0.089549 dev cpu size 10485760 integer add: 0.251854 dev cpu size 10485760 float32 add: 0.292280 dev cpu size 10485760 logical red: 0.089407 dev cpu size 10485760 integer red: 0.270315 dev cpu size 10485760 float32 red: 0.283756

dev gpu size 10485760 logical and: 0.073739 dev gpu size 10485760 integer add: 0.148312 dev gpu size 10485760 float32 add: 0.051901 dev gpu size 10485760 logical red: 0.026299 dev gpu size 10485760 integer red: 0.169700 dev gpu size 10485760 float32 red: 0.053469

The main issues I found with your program are:

  1. Beware of session.run() overhead. Instead of looping many times over session.run(), its better to run one graph that’s expensive enough to get a significant time measurement.
  2. Beware of data copying. If you run a single Op on a GPU, you may be mostly measuring the time it takes to copy inputs to the GPU, and maybe the result back off.
  3. Beware of cache effects. For some Ops, effective choreography of data through the cache is an important aspect of efficiency, so we really need to test cold cache performance, not just loop over execution while all data stays in cache.

So my approach is to build a test graph for each Op consisting of many similar instances of the Op application, where the inputs are nearly all device-local. We still get Op-dispatch overhead in the measured time, which is going to be relatively more significant for small tensors.

My program compares elementwise Add and And, in addition to full-tensor reductions over boolean, int32, and float32 types, for both CPU and GPU. What I’m seeing is that all these GPU operations are a bit faster than the CPU operations on large tensors, as expected, but on small tensors they are strangely slower, with a very large performance anomaly for boolean logical reduction of a small tensor.

I’m going to open up an internal bug ticket on this issue.

I would be interested if you see something significantly different on your system, or if you can find some flaw in the design of my benchmark program.

investigating