actix-web: client timeout starts too early causing "408" errors
I see occasional “408 Request Timeout” errors when the system is busy. Tracing it shows e.g.
13:43:46.715691 accept4(17, {sa_family=AF_INET, sin_port=htons(48642), sin_addr=inet_addr("127.0.0.1")}, [128->16], SOCK_CLOEXEC) = 86
13:43:52.199250 ioctl(86, FIONBIO, [1]) = 0
13:43:52.199293 epoll_ctl(24, EPOLL_CTL_ADD, 86, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLET, {u32=7, u64=64768106823687}}) = 0
13:43:52.199324 getpeername(86, {sa_family=AF_INET, sin_port=htons(48642), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 0
13:43:52.201563 sendto(86, "HTTP/1.1 408 Request Timeout\r\ncontent-length: 0\r\nconnection: close\r\ndate: Sat, 11 Jan 2020 12:43:46 GMT\r\n\r\n", 107, MSG_NOSIGNAL, NULL, 0) = 107
13:43:52.201657 shutdown(86, SHUT_WR) = 0
on the actix-web server and
13:43:46.715522 socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 60
13:43:46.715560 ioctl(60, FIONBIO, [1]) = 0
13:43:46.715584 epoll_ctl(9, EPOLL_CTL_ADD, 60, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3586986337, u64=139705988223329}}) = 0
13:43:46.715607 connect(60, {sa_family=AF_INET, sin_port=htons(3001), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
13:43:46.715688 getsockopt(60, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
13:43:46.715714 writev(60, [{iov_base="HEAD /api/v1/download/...iov_", len=372}], 1) = 372
13:43:52.201614 recvfrom(60, "HTTP/1.1 408 Request Timeout\r\ncontent-length: 0\r\nconnection: close\r\ndate: Sat, 11 Jan 2020 12:43:46 GMT\r\n\r\n", 4096, 0, NULL, NULL) = 107
13:43:52.201735 close(60) = 0
on the client (nginx proxy).
E.g. it appears the actix-web does:
- accepts connection
- starts timeout timer
- starts listening for events
- throws “408” when timeout expired
Here, there are more than 5 seconds between 2. and 3. so that the timeout triggers.
I am not sure, why there are more than 5 seconds between 1 and 3; usually it are <10 ms but in cases like above it affects a cluster of connections.
But imo it would be better to start the timer only when there we are ready to handle the request (e.g. at the moment of the first EPOLL_CTL_ADD
).
actix-web 2.0.0 actix-rt 1.0.0 Linux 5.4.7
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 4
- Comments: 30 (10 by maintainers)
I believe there are other problems with this issue.
408
status code to the client when the server is busy. The RFC says:Also, the RFC talks about the 4xx status codes like this:
I believe in this case the client has definitely not erred, sent the request in time, and waited for the answer to come back. I also believe the correct status code would be probably
503
:And therefore I believe the correct fix for this would be to separate the former client erred case from the latter where the developer(s) erred in the way they implemented the handlers.
Maybe this is just a theory but a server under high pressure - even with a correct handler implementation - could start throwing 408 errors because the two timers are not separated (client timeout vs server timeout), then these errors would be classified as client errors by the load balancers and therefore they would not remove the specific instance(s) from the fleet to give them some time to recover.
This status code put us in a completely wrong direction when we tried to trace down an issue with MeiliSearch since we thought that this was a client error and therefore the client codebase was wrong somewhere and not finishing the request sending in time, or somewhere an issue with the load balancer, when actually it was an issue on the server-side.
Hello 👋,
Meilisearch is affected by this issue, using actix-web 4.3.1, as reported by users in https://github.com/meilisearch/meilisearch/issues/3653.
For convenience, I put together a “minimal” reproducer for the HTTP 408 that uses only actix-web and ureq to reproduce in a single binary, in https://github.com/dureuill/issue-3653-reproducer. By artificially blocking the worker threads, I notice that HTTP 408 statuses are returned to clients, even though I believe the clients did nothing incorrectly on their end.
My uninformed understanding of the expected behavior would be that incoming connections should be kept in a pending state when all worker threads are busy, or worst case if the server is overloaded it should answer with an HTTP 503. This is what https://github.com/actix/actix-web/issues/1269#issuecomment-764618150 upthread is suggesting as well, but I have no personal experience in writing an HTTP server so it could be different in practice.
I noticed that this issue is closed, so maybe there’s a simple configuration fix I’m missing though. If that is the case, could you point me in the correct direction for what configuration to apply?
Thank you for reading and maintaining actix-web ❤️
It is back. But I doubt it is really helpful.
fakeshadow on the actix discord said, that
keep_alive
is misused as first request timeout per connection.I don’t know, wether that is still the case for actix-web 3.3.2
Here you go: https://github.com/typetetris/rust-web-reproducer
START flag can be insert earlier. Like don’t set start on successful decode of request but rather set when the first read of stream is polled. and before that happen use client timeout as ka_timer.
hi @ensc, not sure I got about 50 parallel requests and why it is expensive to convert to a future? Parallel requests processing that’s exactly the case why you would need async, otherwise you might just run 1000 workers in parallel but that will be more expensive. Running blocking work in a separate thread pool is a common pattern across the whole async Rust ecosystem, AFAIK there is no alternative to that. It is cheaper to run blocking code from a dedicated thread pool, than starting new threads for every new blocking request. It is also better on performance than waiting until your blocking request would complete. Sending data between threads also takes significantly less than a network packet transmission. That the right way to do it and would unlock workers to read data from incoming requests while waiting for response from external resource.
Consider that you have external queries which take 10ms on average. While worker is locked all other connection requests just queue up, though If you unlock worker it will be able to poll on other requests sending those requests in parallel. Now if you get one exceptional request which would block for 5 seconds all other requests will just halt hanging client request.
As correct usage is documented now can we close issue?
Same issue here ! If all the workers are busy all other incoming request receive 408.
This is a big issue for me because i generate image that can take up to 10sc so with the default settings i can only serve 8 clients (because default setting is set to 8 workers), all the other requests got timeout. Setting
client_timeout(0)
doesn’t fix the issue 😢 I create a repo where you can find an exemple with benchmark -> https://github.com/Farkal/actix-test-timeout