rusoto: HttpDispatchError - Consistently happening in 43 beta

Testing out the new async/await. 0.43.0-beta.1.

I have a service running on Lambda, that initializes a DynamoDB client then reuses it as the lambda job executes multiple times. This service is fairly active with dozens of lambda jobs handling hundreds of requests every few seconds.

What I am seeing, that I have not before is the following error:

Error during dispatch: connection closed before message completed

This error is handled and the client is reused on the next call and it will work normally for a bit and then the error happens again. Have not determined any specific period of time yet.

It’s obvious from the timings that in the “normal” case, the client is reusing an open http connection. When this error happens and the client has to re-establish the connection, there is a spike in execution time(100ms or so overhead)

The question is, for a long running job that has a cached and reused client, should this be expected, or is there some error handling that is missing, or some timeout of the Hyper http client that needs to be increased?

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 16
  • Comments: 15 (1 by maintainers)

Commits related to this issue

Most upvoted comments

@jatsrt Thanks for the insight! We’re considering implementing client side retries to deal with the occasional error as well.

@silverjam problem with a new client every operation is you get a fairly large overhead to setup the connection. Reusing the client that keeps the connection open can save 100-200ms per call.

We still use the workaround above, and deal with the occasional error that it throws, just using lambda retries or retries on the client side(if it’s through a client-server call)

I feel like there has to be a better way to manage this, I just haven’t had time to look into it too much

Also to note, the majority of our use is through lambda, and we want those connections kept alive, as the overhead of a new connection equals money paid. Most of our lambdas run very frequently, so we do not see timeouts very often, but we keep the code above in production for quiet times to reduce the errors it causes.

The default for ALBs is 60 seconds so if AWS are dogfooding their own load balancers, it’s not surprising that they have this as the timeout.

But shouldn’t we set this to like 55 seconds to be safe? (VMs tend to have imprecise clocks etc.)