tensorflow: grpc RecvTensor is slow

I made benchmark tests for distributed setup with loopback network, profiling it and found there is excessive memory copying in the client side of RecvTensor call, which is actually one of the bottleneck.

Here is the code, which mainly stolen from @yaroslavvb here,

  with tf.device(device1):                                                      
    params = tf.get_variable("params", shape=[params_size], dtype=dtype,        
                             initializer=tf.zeros_initializer)                  
  with tf.device(device2):                                                      
    # constant node gets placed on device1 because of simple_placer             
    #    update = tf.constant(1, shape=[params_size], dtype=dtype)              
    update = tf.get_variable("update", shape=[params_size], dtype=dtype,        
                             initializer=tf.ones_initializer())                 
    add_op = params.assign(update)

Here is the profiling result (google perftools) with tensor size 100MB (one fact is, the throughput will degrade with the increasing of tensor size):

From the result, the sending side (device2) look fine, but the receiving side (device1, the grpc client) consumes too many CPU cycles for the data transfer.

By the way, I made rough stats for this memmove call. For one round of 100MB tensor assignment, there are roughly 2GB data moved (actually, including the copy inside memmove, it should be 4GB copied with a naive memmove), which is 20+ times RAM bandwidth amplification (the result is an average of 100 round run, which may not precise but the scale should be ok).

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 42 (39 by maintainers)

Commits related to this issue

Integrate gRPC changes to fix #6116 — committed to llhe/tensorflow by llhe 7 years ago
Integrate gRPC changes to fix #6116 — committed to llhe/tensorflow by llhe 7 years ago
Integrate gRPC changes to fix #6116 — committed to llhe/tensorflow by llhe 7 years ago
Integrate gRPC changes to fix #6116 — committed to llhe/tensorflow by llhe 7 years ago
Integrate gRPC changes to fix #6116 — committed to llhe/tensorflow by llhe 7 years ago
Integrate gRPC changes to fix #6116 — committed to vjpai/tensorflow by llhe 7 years ago

Most upvoted comments

To make things more clear, I collected more detailed data for memmove:

  // int move_size = (sb->count - 1) * sizeof(gpr_slice);
  memmove(&sb->slices[0], &sb->slices[1], (sb->count - 1) * sizeof(gpr_slice));
  // int data_size = GPR_SLICE_LENGTH(slice);

A typical move_size, slice_size sequence,

move_size: 6096, slice_size: 608
move_size: 6072, slice_size: 8192
move_size: 6048, slice_size: 7583
move_size: 6024, slice_size: 600
move_size: 6000, slice_size: 8192
move_size: 5976, slice_size: 7591
move_size: 5952, slice_size: 592
move_size: 5928, slice_size: 8192
move_size: 5904, slice_size: 7599
move_size: 5880, slice_size: 584
move_size: 5856, slice_size: 8192
move_size: 5832, slice_size: 7607
move_size: 5808, slice_size: 576
move_size: 5784, slice_size: 8192
move_size: 5760, slice_size: 7615
move_size: 5736, slice_size: 568
move_size: 5712, slice_size: 8192
move_size: 5688, slice_size: 7623
move_size: 5664, slice_size: 560
move_size: 5640, slice_size: 8192
move_size: 5616, slice_size: 7631
move_size: 5592, slice_size: 552
move_size: 5568, slice_size: 8192
move_size: 5544, slice_size: 7639
move_size: 5520, slice_size: 544
move_size: 5496, slice_size: 8192
move_size: 5472, slice_size: 7647
move_size: 5448, slice_size: 536
move_size: 5424, slice_size: 8192
move_size: 5400, slice_size: 7655
move_size: 5376, slice_size: 528
move_size: 5352, slice_size: 8192
move_size: 5328, slice_size: 7663
move_size: 5304, slice_size: 520

So the problem is obvious (the slice_size will sum to 100MB per run). The root cause should be the grpc buffer management does not work well for large message. This also explains why the throughput will decrease with the increase of the tensor size.

Not quite familiar with the grpc code, adding an grpc option to change ‘gpr_slice_buffer_take_first’ to ‘gpr_slice_buffer_take_all’ can remove the unnecessary memory copy? Tuning the slice size can also help reducing the overhead but can’t eliminate it.

llhe on Dec 10, 2016

I use this script to test tensor transmission on 3 different machines, note that task 2 is a client which is responsible for submitting job. grpc is about 500MB/s, grpc+gdr is about 3400MB/s.

python3 tensor_transmission.py --host=xxxx1 --port1=xx1 --host_2=xxxx2 --port2=xx2 --task=0
python3 tensor_transmission.py --host=xxxx1 --port1=xx1 --host_2=xxxx2 --port2=xx2 --task=1
python3 tensor_transmission.py --host=xxxx1 --port1=xx1 --host_2=xxxx2 --port2=xx2 --task=2

suiyuan2009 on Aug 14, 2017

Interesting!

The correct fix is probably to have grpc_chttp2_incoming_byte_stream become a ring-like buffer, so instead of doing a move down the slice array, we just increment an index. When we reach the end of slices, we can reset the counter to zero.

I’ll make sure someone takes a look soon.

ctiller on Dec 6, 2016