runtime: Every connection to Kestrel suddenly timing out, part of the application seems "frozen"

Description

Hello,

The issue manifests it seems at random on a number of Ubuntu 18.04 servers running our Kestrel-based .NET application. The servers in question did not install any package updates recently that might have contributed to such behavior.

Kestrel stops responding to requests - all of them time out.

$ curl -vvv -k https://server.com/alive
Trying 1.2.3.4...
TCP_NODELAY set

The machine does not experience high load at the time of the incident. CPU, memory and IO usage is at normal or even low rates.

We captured .NET counters in hope of finding clues of some kind of thread starvation but we did not see anything out of the ordinary:

    Status: Running

[System.Runtime]
    % Time in GC since last GC (%)                                         0
    Allocation Rate (B / 1 sec)                                      529,872
    CPU Usage (%)                                                          3
    Exception Count (Count / 1 sec)                                        0
    GC Committed Bytes (MB)                                            1,511
    GC Fragmentation (%)                                                  66.514
    GC Heap Size (MB)                                                    527
    Gen 0 GC Count (Count / 1 sec)                                         0
    Gen 0 Size (B)                                                65,608,328
    Gen 1 GC Count (Count / 1 sec)                                         0
    Gen 1 Size (B)                                                16,600,328
    Gen 2 GC Count (Count / 1 sec)                                         0
    Gen 2 Size (B)                                                    1.0128e+09
    IL Bytes Jitted (B)                                            9,668,963
    LOH Size (B)                                                      3.2404e+08
    Monitor Lock Contention Count (Count / 1 sec)                         25
    Number of Active Timers                                              526
    Number of Assemblies Loaded                                          580
    Number of Methods Jitted                                         114,669
    POH (Pinned Object Heap) Size (B)                              2,648,776
    ThreadPool Completed Work Item Count (Count / 1 sec)                 701
    ThreadPool Queue Length                                                0
    ThreadPool Thread Count                                               20
    Time spent in JIT (ms / 1 sec)                                         0
    Working Set (MB)                                                   4,785

I attach stack traces taken with dotnet-stack.

Now the surprising part, things that “unblocks it” is:

  • strace call sudo strace -T -t -f -p $PID

  • making a minidump with dotnet-dump

  • restarting the service (not surprising)

After performing the above actions Kestrel is responding to requests again.

We would be grateful for any advice where to look further for the root cause or any additional diagnostics tips.

Reproduction Steps

Don’t know yet

Expected behavior

Kestrel responds to requests.

Actual behavior

Requests time out.

Regression?

Not sure.

Known Workarounds

  • strace call sudo strace -T -t -f -p $PID

  • making a minidump with dotnet-dump

  • restarting the service (not surprising)

Configuration

Which version of .NET is the code running on?

.NET 6.0.11

OS: Ubuntu 18.04 Architecture: x64 Config-specific: don’t know

Other information

stacks.txt

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 71 (43 by maintainers)

Most upvoted comments

We have deployed Microsoft.Extensions.Logging and an additional TCP socket on a few instances where this issue occurred, but the issue did not reproduce there again since. At that point I noticed it only reproduces on just a few machines with higher uptime at the moment, but that might have been a coicidence.

However, I dug into epoll docs and in the manual there is a piece on edge-triggered mode of epoll which can lead to starvation if used improperly. See here: https://man7.org/linux/man-pages/man7/epoll.7.html

Look for “starvation” or EPOLLET. It can be found being used in dotnet/runtime: https://github.com/dotnet/runtime/blob/5d1b7e77e054f74de05d6cd34de11c55ffbd125f/src/native/libs/System.Native/pal_networking.c#L2693

So the thesis at this point was “something is wrong with epoll code in dotnet which leads to above referenced starvation”. But I guess if it was the case it would be found earlier, so I really doubt that.

So I looked around for epoll plus kernel line version number that is installed there (5.4.0-azure-1095). A bunch of links I found showed that this is might be a kernel regression caused by an optimization in epoll:

https://github.com/opencontainers/runc/issues/3641 https://github.com/prometheus/node_exporter/issues/2500 https://bugs.launchpad.net/ubuntu/+source/containerd/+bug/1996678 (here the fix is mentioned)

direct link to fix description: https://bugs.launchpad.net/ubuntu/+source/containerd/+bug/1996678/comments/28

@andrewdike I’d start by checking the Kernel version on affected systems.