tensorflow: grpc RecvTensor is slow

I made benchmark tests for distributed setup with loopback network, profiling it and found there is excessive memory copying in the client side of RecvTensor call, which is actually one of the bottleneck.

Here is the code, which mainly stolen from @yaroslavvb here,

  with tf.device(device1):                                                      
    params = tf.get_variable("params", shape=[params_size], dtype=dtype,        
                             initializer=tf.zeros_initializer)                  
  with tf.device(device2):                                                      
    # constant node gets placed on device1 because of simple_placer             
    #    update = tf.constant(1, shape=[params_size], dtype=dtype)              
    update = tf.get_variable("update", shape=[params_size], dtype=dtype,        
                             initializer=tf.ones_initializer())                 
    add_op = params.assign(update)

Here is the profiling result (google perftools) with tensor size 100MB (one fact is, the throughput will degrade with the increasing of tensor size):

From the result, the sending side (device2) look fine, but the receiving side (device1, the grpc client) consumes too many CPU cycles for the data transfer.

By the way, I made rough stats for this memmove call. For one round of 100MB tensor assignment, there are roughly 2GB data moved (actually, including the copy inside memmove, it should be 4GB copied with a naive memmove), which is 20+ times RAM bandwidth amplification (the result is an average of 100 round run, which may not precise but the scale should be ok).

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 42 (39 by maintainers)

Commits related to this issue

Most upvoted comments

To make things more clear, I collected more detailed data for memmove:

  // int move_size = (sb->count - 1) * sizeof(gpr_slice);
  memmove(&sb->slices[0], &sb->slices[1], (sb->count - 1) * sizeof(gpr_slice));
  // int data_size = GPR_SLICE_LENGTH(slice);

A typical move_size, slice_size sequence,

  • move_size: 6096, slice_size: 608
  • move_size: 6072, slice_size: 8192
  • move_size: 6048, slice_size: 7583
  • move_size: 6024, slice_size: 600
  • move_size: 6000, slice_size: 8192
  • move_size: 5976, slice_size: 7591
  • move_size: 5952, slice_size: 592
  • move_size: 5928, slice_size: 8192
  • move_size: 5904, slice_size: 7599
  • move_size: 5880, slice_size: 584
  • move_size: 5856, slice_size: 8192
  • move_size: 5832, slice_size: 7607
  • move_size: 5808, slice_size: 576
  • move_size: 5784, slice_size: 8192
  • move_size: 5760, slice_size: 7615
  • move_size: 5736, slice_size: 568
  • move_size: 5712, slice_size: 8192
  • move_size: 5688, slice_size: 7623
  • move_size: 5664, slice_size: 560
  • move_size: 5640, slice_size: 8192
  • move_size: 5616, slice_size: 7631
  • move_size: 5592, slice_size: 552
  • move_size: 5568, slice_size: 8192
  • move_size: 5544, slice_size: 7639
  • move_size: 5520, slice_size: 544
  • move_size: 5496, slice_size: 8192
  • move_size: 5472, slice_size: 7647
  • move_size: 5448, slice_size: 536
  • move_size: 5424, slice_size: 8192
  • move_size: 5400, slice_size: 7655
  • move_size: 5376, slice_size: 528
  • move_size: 5352, slice_size: 8192
  • move_size: 5328, slice_size: 7663
  • move_size: 5304, slice_size: 520

So the problem is obvious (the slice_size will sum to 100MB per run). The root cause should be the grpc buffer management does not work well for large message. This also explains why the throughput will decrease with the increase of the tensor size.

Not quite familiar with the grpc code, adding an grpc option to change ‘gpr_slice_buffer_take_first’ to ‘gpr_slice_buffer_take_all’ can remove the unnecessary memory copy? Tuning the slice size can also help reducing the overhead but can’t eliminate it.

I use this script to test tensor transmission on 3 different machines, note that task 2 is a client which is responsible for submitting job. grpc is about 500MB/s, grpc+gdr is about 3400MB/s.

python3 tensor_transmission.py --host=xxxx1 --port1=xx1 --host_2=xxxx2 --port2=xx2 --task=0
python3 tensor_transmission.py --host=xxxx1 --port1=xx1 --host_2=xxxx2 --port2=xx2 --task=1
python3 tensor_transmission.py --host=xxxx1 --port1=xx1 --host_2=xxxx2 --port2=xx2 --task=2

Interesting!

The correct fix is probably to have grpc_chttp2_incoming_byte_stream become a ring-like buffer, so instead of doing a move down the slice array, we just increment an index. When we reach the end of slices, we can reset the counter to zero.

I’ll make sure someone takes a look soon.