tensorflow: Boolean operations on GPU are extremely slow
ArchLinux. Cuda7.5. NIghtly TF built for Python2.
import tensorflow as tf
import time
v = tf.get_variable('test', shape=[100, 100, 100])
vb = tf.get_variable('test2', shape=[100, 100, 100], dtype=tf.bool,
initializer=tf.constant_initializer(False))
b1 = tf.reduce_sum(v)
b2 = tf.reduce_all(vb)
b3 = tf.reduce_all(tf.cast(v, tf.bool))
sess = tf.Session()
sess.run(tf.initialize_all_variables())
with sess.as_default():
start = time.time()
for k in range(100):
sess.run(b1)
print time.time() - start # 0.02s
start = time.time()
for k in range(100):
sess.run(b2)
print time.time() - start # 7s!
start = time.time()
for k in range(100):
sess.run(b3)
print time.time() - start # 17s!
CPU version of the same operation is also much faster than this.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 3
- Comments: 15 (14 by maintainers)
This is now fixed internally. Boolean reductions will be approximately 1000x faster now.
First, thanks for reporting this issue. There is something going on that looks like a possible bug.
Second, writing good performance benchmarks that measure what you intend, and not something else, is hard. I’m still not completely confident I’ve gotten it right. However, I’m going to suggest an alternative benchmark program, then explain why. It’s not precisely what I run on my system, and you may need to mess with the python a bit to get it to run on yours.
benchmark.txt
My CPU is a dual-socket Haswell, and my GPU is a GTX Titan X. Here’s the output I get:
dev cpu size 1024 logical and: 0.002729 dev cpu size 1024 integer add: 0.002192 dev cpu size 1024 float32 add: 0.002473 dev cpu size 1024 logical red: 0.004275 dev cpu size 1024 integer red: 0.005276 dev cpu size 1024 float32 red: 0.004385
dev gpu size 1024 logical and: 0.068111 dev gpu size 1024 integer add: 0.004291 dev gpu size 1024 float32 add: 0.003737 dev gpu size 1024 logical red: 1.761069 <<<< ANOMALY dev gpu size 1024 integer red: 0.007658 dev gpu size 1024 float32 red: 0.006458
dev cpu size 1048576 logical and: 0.018126 dev cpu size 1048576 integer add: 0.035838 dev cpu size 1048576 float32 add: 0.039299 dev cpu size 1048576 logical red: 0.027802 dev cpu size 1048576 integer red: 0.048078 dev cpu size 1048576 float32 red: 0.053308
dev gpu size 1048576 logical and: 0.007452 dev gpu size 1048576 integer add: 0.015339 dev gpu size 1048576 float32 add: 0.010211 dev gpu size 1048576 logical red: 0.009259 dev gpu size 1048576 integer red: 0.021310 dev gpu size 1048576 float32 red: 0.011330
dev cpu size 10485760 logical and: 0.089549 dev cpu size 10485760 integer add: 0.251854 dev cpu size 10485760 float32 add: 0.292280 dev cpu size 10485760 logical red: 0.089407 dev cpu size 10485760 integer red: 0.270315 dev cpu size 10485760 float32 red: 0.283756
dev gpu size 10485760 logical and: 0.073739 dev gpu size 10485760 integer add: 0.148312 dev gpu size 10485760 float32 add: 0.051901 dev gpu size 10485760 logical red: 0.026299 dev gpu size 10485760 integer red: 0.169700 dev gpu size 10485760 float32 red: 0.053469
The main issues I found with your program are:
So my approach is to build a test graph for each Op consisting of many similar instances of the Op application, where the inputs are nearly all device-local. We still get Op-dispatch overhead in the measured time, which is going to be relatively more significant for small tensors.
My program compares elementwise Add and And, in addition to full-tensor reductions over boolean, int32, and float32 types, for both CPU and GPU. What I’m seeing is that all these GPU operations are a bit faster than the CPU operations on large tensors, as expected, but on small tensors they are strangely slower, with a very large performance anomaly for boolean logical reduction of a small tensor.
I’m going to open up an internal bug ticket on this issue.
I would be interested if you see something significantly different on your system, or if you can find some flaw in the design of my benchmark program.
investigating