serving: serving server does not infer batch

OS: Ubuntu 18.04 docker image: tensorflow/serving:latest-gpu Tensorflow: 2.0.0a0

I’m using tesorflow/serving:latest-gpu, but I think it does not use gpu. I request input with shape [48, 256, 256, 3] and it takes about 12 seconds. But when I request with shape [1, 256, 256, 3] it takes about 0.3 seconds.

this is request code

# grids shape: [4, 48, 256, 256, 3]
grids, positions, inds = infer_preprocess(img, mask, FLAGS.n_grids)

headers = {'content-type': 'application/json'}
predictions = []

for grid in grids:
    # A grid shape: [48, 256, 256, 3]
    grid = grid/127.5 - 1
    data = json.dumps({
        'signature_name': 'serving_default', 'instances': grid.tolist()
    })
  
     json_response = requests.post('http://10.113.66.143:30256/v1/models/new_test:predict', data=data, headers=headers)

    prediction = json.loads(json_response.text)
    print('DONE')
    try:
        prediction = np.array(prediction['predictions'])
        predictions.append(prediction)
    except:
        print(prediction['error'])

and batching_parameters.conf

num_batch_threads { value: 48 }
batch_timeout_micros { value: 5000}
max_batch_size {value: 20000001}

I ran server.sh

sudo nvidia-docker run -t --rm -p 8501:8501 -v ~/models:/root/models --name serve tensorflow/serving:latest-gpu --enable_batching=true --batching_parameters_file=/root/models/batching_parameters.txt --model_config_file=/root/models/model_specific.conf

I guess serving server infers input one by one. How can server infers a batch?

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 15 (7 by maintainers)

Most upvoted comments

What @rmothukuru posted is the library-level batching guide - if that’s too low-level, please take a look at the config-level batching guide.

However, please do keep in mind, TF serving matching is inter-request level; meaning, it creates a latency-aware queue to batch together requests that are arriving independently before running them through the graph. This is useful if you have many independent clients calling TF serving that cannot coordinate with one another.

If you’re sending multiple requests from the same client, then it makes little sense to configure batching on TF Serving. What you instead want is to stack the different examples together along the zeroth dimension and send all examples with a single request. This is what it seems you have done - stacking 48 examples together and sending them with a single call to TF Serving, at which point nothing about the batching configuration is relevant (you only have a single request) the entire thing gets fed into session.run at once and at that point it’s executing your graph on your hardware as it would if you did session.run() in python (i.e. it’s out of serving’s domain).

If you’re observing the latency climb linearly with the number of examples you’re batching, that’s a sign that there exist some portion of your graph execution that’s not vectorized - it could be i/o, could be pre/post-processing or other portions that are forced on the cpu. If you’d like us to debug and help, please provide your model and example requests and we’ll take a look.

misterpeddy on Aug 2, 2019

There’s a performance guide I’m merging very soon that should help with this.

misterpeddy on Jan 16, 2020

@jusonn , Can you please refer this link and confirm if this helps. Thanks.

rmothukuru on Jul 19, 2019