tensorflow: grpc RecvTensor is slow
I made benchmark tests for distributed setup with loopback network, profiling it and found there is excessive memory copying in the client side of RecvTensor call, which is actually one of the bottleneck.
Here is the code, which mainly stolen from @yaroslavvb here,
with tf.device(device1):
params = tf.get_variable("params", shape=[params_size], dtype=dtype,
initializer=tf.zeros_initializer)
with tf.device(device2):
# constant node gets placed on device1 because of simple_placer
# update = tf.constant(1, shape=[params_size], dtype=dtype)
update = tf.get_variable("update", shape=[params_size], dtype=dtype,
initializer=tf.ones_initializer())
add_op = params.assign(update)
Here is the profiling result (google perftools) with tensor size 100MB (one fact is, the throughput will degrade with the increasing of tensor size):
From the result, the sending side (device2) look fine, but the receiving side (device1, the grpc client) consumes too many CPU cycles for the data transfer.
By the way, I made rough stats for this memmove call. For one round of 100MB tensor assignment, there are roughly 2GB data moved (actually, including the copy inside memmove, it should be 4GB copied with a naive memmove), which is 20+ times RAM bandwidth amplification (the result is an average of 100 round run, which may not precise but the scale should be ok).
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 42 (39 by maintainers)
Commits related to this issue
- Integrate gRPC changes to fix #6116 — committed to llhe/tensorflow by llhe 7 years ago
- Integrate gRPC changes to fix #6116 — committed to llhe/tensorflow by llhe 7 years ago
- Integrate gRPC changes to fix #6116 — committed to llhe/tensorflow by llhe 7 years ago
- Integrate gRPC changes to fix #6116 — committed to llhe/tensorflow by llhe 7 years ago
- Integrate gRPC changes to fix #6116 — committed to llhe/tensorflow by llhe 7 years ago
- Integrate gRPC changes to fix #6116 — committed to vjpai/tensorflow by llhe 7 years ago
To make things more clear, I collected more detailed data for memmove:
A typical
move_size,slice_sizesequence,So the problem is obvious (the slice_size will sum to 100MB per run). The root cause should be the grpc buffer management does not work well for large message. This also explains why the throughput will decrease with the increase of the tensor size.
Not quite familiar with the grpc code, adding an grpc option to change ‘gpr_slice_buffer_take_first’ to ‘gpr_slice_buffer_take_all’ can remove the unnecessary memory copy? Tuning the slice size can also help reducing the overhead but can’t eliminate it.
I use this script to test tensor transmission on 3 different machines, note that task 2 is a client which is responsible for submitting job.
grpcis about 500MB/s,grpc+gdris about 3400MB/s.Interesting!
The correct fix is probably to have
grpc_chttp2_incoming_byte_streambecome a ring-like buffer, so instead of doing a move down the slice array, we just increment an index. When we reach the end ofslices, we can reset the counter to zero.I’ll make sure someone takes a look soon.