runtime: Every connection to Kestrel suddenly timing out, part of the application seems "frozen"
Description
Hello,
The issue manifests it seems at random on a number of Ubuntu 18.04 servers running our Kestrel-based .NET application. The servers in question did not install any package updates recently that might have contributed to such behavior.
Kestrel stops responding to requests - all of them time out.
$ curl -vvv -k https://server.com/alive
Trying 1.2.3.4...
TCP_NODELAY set
The machine does not experience high load at the time of the incident. CPU, memory and IO usage is at normal or even low rates.
We captured .NET counters in hope of finding clues of some kind of thread starvation but we did not see anything out of the ordinary:
Status: Running
[System.Runtime]
% Time in GC since last GC (%) 0
Allocation Rate (B / 1 sec) 529,872
CPU Usage (%) 3
Exception Count (Count / 1 sec) 0
GC Committed Bytes (MB) 1,511
GC Fragmentation (%) 66.514
GC Heap Size (MB) 527
Gen 0 GC Count (Count / 1 sec) 0
Gen 0 Size (B) 65,608,328
Gen 1 GC Count (Count / 1 sec) 0
Gen 1 Size (B) 16,600,328
Gen 2 GC Count (Count / 1 sec) 0
Gen 2 Size (B) 1.0128e+09
IL Bytes Jitted (B) 9,668,963
LOH Size (B) 3.2404e+08
Monitor Lock Contention Count (Count / 1 sec) 25
Number of Active Timers 526
Number of Assemblies Loaded 580
Number of Methods Jitted 114,669
POH (Pinned Object Heap) Size (B) 2,648,776
ThreadPool Completed Work Item Count (Count / 1 sec) 701
ThreadPool Queue Length 0
ThreadPool Thread Count 20
Time spent in JIT (ms / 1 sec) 0
Working Set (MB) 4,785
I attach stack traces taken with dotnet-stack
.
Now the surprising part, things that “unblocks it” is:
-
strace call
sudo strace -T -t -f -p $PID
-
making a minidump with dotnet-dump
-
restarting the service (not surprising)
After performing the above actions Kestrel is responding to requests again.
We would be grateful for any advice where to look further for the root cause or any additional diagnostics tips.
Reproduction Steps
Don’t know yet
Expected behavior
Kestrel responds to requests.
Actual behavior
Requests time out.
Regression?
Not sure.
Known Workarounds
-
strace call
sudo strace -T -t -f -p $PID
-
making a minidump with dotnet-dump
-
restarting the service (not surprising)
Configuration
Which version of .NET is the code running on?
.NET 6.0.11
OS: Ubuntu 18.04 Architecture: x64 Config-specific: don’t know
Other information
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 2
- Comments: 71 (43 by maintainers)
We have deployed Microsoft.Extensions.Logging and an additional TCP socket on a few instances where this issue occurred, but the issue did not reproduce there again since. At that point I noticed it only reproduces on just a few machines with higher uptime at the moment, but that might have been a coicidence.
However, I dug into
epoll
docs and in the manual there is a piece on edge-triggered mode of epoll which can lead to starvation if used improperly. See here: https://man7.org/linux/man-pages/man7/epoll.7.htmlLook for “starvation” or
EPOLLET
. It can be found being used in dotnet/runtime: https://github.com/dotnet/runtime/blob/5d1b7e77e054f74de05d6cd34de11c55ffbd125f/src/native/libs/System.Native/pal_networking.c#L2693So the thesis at this point was “something is wrong with epoll code in dotnet which leads to above referenced starvation”. But I guess if it was the case it would be found earlier, so I really doubt that.
So I looked around for epoll plus kernel line version number that is installed there (
5.4.0-azure-1095
). A bunch of links I found showed that this is might be a kernel regression caused by an optimization in epoll:https://github.com/opencontainers/runc/issues/3641 https://github.com/prometheus/node_exporter/issues/2500 https://bugs.launchpad.net/ubuntu/+source/containerd/+bug/1996678 (here the fix is mentioned)
direct link to fix description: https://bugs.launchpad.net/ubuntu/+source/containerd/+bug/1996678/comments/28
@andrewdike I’d start by checking the Kernel version on affected systems.
Maybe related to this https://github.com/dotnet/aspnetcore/issues/41556?