actix-net: Connection not closed properly
I’ve been using 1.x version of actix-web for months, had to restart my app every now and then (sometimes after minutes, sometimes after days) since there are a lot of ESTABLISHED
connections left there hanging, eventually causing too many open files
error (I’ve increased the limit drastically). I’m using my server with keep-alive disabled, the rest of the settings are the defaults. I have since tried to upgrade to 2.0.0 to see if it solves the problem, but it’s the same thing.
The service itself gets around 500-1000 requests per second in production currently.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 5
- Comments: 29 (14 by maintainers)
Here is a reproducible procedure and (
nearly) minimal example: https://github.com/finnbear/actix-tcp-leakThe key to reproducing the bug is to establish a HTTPS connection from another host, omit an SSL handhake, and then terminate the underlying internet connection (i.e. WiFi in the case of a laptop). This can be simulated by leaving open a
TcpStream
to an SSL port without establishing SSL. I observed that such a connection will be leaked for hours (maybe forever), but far longer than the client_timeout and keepalive period I set.I will now work to further minimize and containerize the example.✔️The example now features:
Dockerfile
actix-web
in favor ofactix-server
(allthough I couldn’t find a way to enable TCP keep alive foractix-server
, so leaks are to be expected)Three separate functions, each producing a different type of TCP connection leak (2 of them usebind_rustls
), in addition to the disable-WiFi methodUpdate: Upon further investigation, the leak affecting
actix-web
is specific tobind_rustls
sockets that do not attempt an SSL handshake. The example now includes one function to reproduce this. By contrast, the leak affectingactix-server
is arguably to be expected, since TCP keep alive doesn’t seem to be in use. There is also a function in the example to reproduce this leak.Update 2: I have created a new issue #392 that is specific to the highly reproducible,
bind_rustls
-based, leak. It would be interesting to know what percentage of people with this issue are usingbind_rustls
(orbind_openssl
which may also be affected) versus regularbind
.Update 3: Apparently the TLS handshake leak isn’t the only leak, as my server still leaks connections from actual users at a similar rate.
Update 4: That other leak is mostly due to HTTP/2, see https://github.com/actix/actix-web/issues/2419. If you are exposing
actix-web
’sHttpServer
directly to real users over the internet, there is a good chance this issue affects you.Update 5: The HTTP/2 leak has been plugged (hooray!). I have validated that this significantly reduces the rate at which connections are leaked (it is about 1/3 of what it was).
I have the same issue: An application that receives around 10 requests per minute fails over the span of a couple hours, because it runs out of available file descriptors.
4.4.0-165-generic actix/actix-web#193-Ubuntu SMP Tue Sep 17 17:42:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
(Ubuntu 16.04.7)Interestingly, if i look in
/proc/<pid>/fd
it looks like almost all the sockets were created at the same time. However that time correlates with when i first checked there, so that might be a kernel artifact.Edit: I have tried reproducing the issue on another computer using
ab
(apache benchmark) with 10k requests, 1k concurrent and not found any issues. Information of that system:5.8.10-arch1-1 actix/actix-web#1 SMP PREEMPT Thu, 17 Sep 2020 18:01:06 +0000 x86_64 GNU/Linux
(Arch Linux)@orangesoup what is
ulimit -n
in the shell where you running actix server?Mine had 1024 which is default and I spot ~24 errors in the log and ~10 hang up ESTABLISHED connections:
Increased limit with
ulimit -n 65535
, after restarting I did not see any errors in the log and all connections closed after the test. Can you please check and confirm you see the same on your side.@tyranron if that confirmed that would be workaround via proper server setup, maybe should be documented as those errors do not show up unless logger enabled as you adviced. But we still should look for the reason why connection sometimes hang (I suppose some unsafe code in error handler etc)
I’m experiencing the same issue, with this set of versions:
My
actix-web
server runs on stock Debian 11 and hosts both REST and websocket endpoints usingrustls
-based TLS. It leaks about 100ESTABLISHED
TCP connections every day (out of an estimated ~10,000 roughly uniformly distributed connections), and has already crashed once because of this. When I runnetstat -nat
, I see that the vast majority of connections have a 0 send-q and recv-q. A small minority have a few 100-1000 for the send-q.I doubt this is relevant but I have my
keep_alive
set toKeepAlive::Timeout(30)
.What can I do to help troubleshoot?