actix-net: Connection not closed properly

I’ve been using 1.x version of actix-web for months, had to restart my app every now and then (sometimes after minutes, sometimes after days) since there are a lot of ESTABLISHED connections left there hanging, eventually causing too many open files error (I’ve increased the limit drastically). I’m using my server with keep-alive disabled, the rest of the settings are the defaults. I have since tried to upgrade to 2.0.0 to see if it solves the problem, but it’s the same thing.

The service itself gets around 500-1000 requests per second in production currently.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 5
Comments: 29 (14 by maintainers)

Most upvoted comments

Here is a reproducible procedure and (~~nearly~~) minimal example: https://github.com/finnbear/actix-tcp-leak

The key to reproducing the bug is to establish a HTTPS connection from another host, omit an SSL handhake, and then terminate the underlying internet connection (i.e. WiFi in the case of a laptop). This can be simulated by leaving open a TcpStream to an SSL port without establishing SSL. I observed that such a connection will be leaked for hours (maybe forever), but far longer than the client_timeout and keepalive period I set.

~~I will now work to further minimize and containerize the example.~~ ✔️

The example now features:

A Dockerfile
A mode that bypasses actix-web in favor of actix-server (allthough I couldn’t find a way to enable TCP keep alive for actix-server, so leaks are to be expected)
~~Three separate functions, each producing a different type of TCP connection leak (2 of them use bind_rustls), in addition to the disable-WiFi method~~

Update: Upon further investigation, the leak affecting actix-web is specific to bind_rustls sockets that do not attempt an SSL handshake. The example now includes one function to reproduce this. By contrast, the leak affecting actix-server is arguably to be expected, since TCP keep alive doesn’t seem to be in use. There is also a function in the example to reproduce this leak.

Update 2: I have created a new issue #392 that is specific to the highly reproducible, bind_rustls-based, leak. It would be interesting to know what percentage of people with this issue are using bind_rustls (or bind_openssl which may also be affected) versus regular bind.

Update 3: Apparently the TLS handshake leak isn’t the only leak, as my server still leaks connections from actual users at a similar rate.

Update 4: That other leak is mostly due to HTTP/2, see https://github.com/actix/actix-web/issues/2419. If you are exposing actix-web’s HttpServer directly to real users over the internet, there is a good chance this issue affects you.

Update 5: The HTTP/2 leak has been plugged (hooray!). I have validated that this significantly reduces the rate at which connections are leaked (it is about 1/3 of what it was).

finnbear on Nov 23, 2021

I have the same issue: An application that receives around 10 requests per minute fails over the span of a couple hours, because it runs out of available file descriptors.

4.4.0-165-generic actix/actix-web#193-Ubuntu SMP Tue Sep 17 17:42:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux (Ubuntu 16.04.7)

Interestingly, if i look in /proc/<pid>/fd it looks like almost all the sockets were created at the same time. However that time correlates with when i first checked there, so that might be a kernel artifact.

Edit: I have tried reproducing the issue on another computer using ab (apache benchmark) with 10k requests, 1k concurrent and not found any issues. Information of that system: 5.8.10-arch1-1 actix/actix-web#1 SMP PREEMPT Thu, 17 Sep 2020 18:01:06 +0000 x86_64 GNU/Linux (Arch Linux)

Kilobyte22 on Sep 30, 2020

@orangesoup what is ulimit -n in the shell where you running actix server?

Mine had 1024 which is default and I spot ~24 errors in the log and ~10 hang up ESTABLISHED connections:

[2020-01-22T18:46:52Z ERROR actix_server::accept] Error accepting connection: Too many open files (os error 24)

Increased limit with ulimit -n 65535, after restarting I did not see any errors in the log and all connections closed after the test. Can you please check and confirm you see the same on your side.

@tyranron if that confirmed that would be workaround via proper server setup, maybe should be documented as those errors do not show up unless logger enabled as you adviced. But we still should look for the reason why connection sometimes hang (I suppose some unsafe code in error handler etc)

dunnock on Jan 22, 2020

I’m experiencing the same issue, with this set of versions:

actix = "0.12"
actix-codec = "0.4.0"
actix-web = {version="4.0.0-beta.9", features=["rustls"]}
actix-web-actors = "4.0.0-beta.7"

My actix-web server runs on stock Debian 11 and hosts both REST and websocket endpoints using rustls-based TLS. It leaks about 100 ESTABLISHED TCP connections every day (out of an estimated ~10,000 roughly uniformly distributed connections), and has already crashed once because of this. When I run netstat -nat, I see that the vast majority of connections have a 0 send-q and recv-q. A small minority have a few 100-1000 for the send-q.

I doubt this is relevant but I have my keep_alive set to KeepAlive::Timeout(30).

What can I do to help troubleshoot?

For various reasons, including this, I must restart my server regularly. What code should I add/change to fix/debug this issue?
I would try the potential fix at https://github.com/actix/actix-net/tree/fix/os_error_23, but that link is now broken.
FWIW, my server and terraform configuration are open source, so anyone could theoretically reproduce the issue by deploying identical infrastructure if the simple example proposed above does not yield a reproduction (it would take a few hours, though)
@fakeshadow You suggesting reducing the connection backlog, which I currently have set to 512. I could reduce it further, but I’m wondering if this would be a self-inflicted denial of service attack if all of the backlog connections were leaked. For example, if I set it to 16, if 16 connections leaked while still in the backlog, can my server still accept connections?

finnbear on Oct 13, 2021