runtime: Socket Dispose during synchronous operation hangs on RedHat 7

Disposing a Socket that has a pending synchronous Connect/Accept operation intermittently hangs on RedHat7 and CentOS7 x64 in CI.

using var client = new Socket(SocketType.Stream, ProtocolType.Tcp);

Exception ex = await Assert.ThrowsAnyAsync<Exception>(async () =>
{
    Task connectTask = Task.Run(() => client.Connect(endPoint));
    await WaitForEventAsync(events, "ConnectStart"); // Wait until the operation starts
    Task disposeTask = Task.Run(() => client.Dispose());
    await new[] { connectTask, disposeTask }.WhenAllOrAnyFailed(); // Hang
});

Existing Connect and Accept tests are potentially masking this failure with the use of a timeout/retry combo.

Example CI Console log

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 49 (49 by maintainers)

Commits related to this issue

Most upvoted comments

The bugzilla issue is now public: https://bugzilla.redhat.com/show_bug.cgi?id=1886305.

I’m using dotnet dump to analyze the dumps: https://github.com/dotnet/diagnostics/blob/master/documentation/dotnet-dump-instructions.md

When you download the coredump, you’ll also need the runtime bits with which it failed:

  1. Download JobList.json https://helix.dot.net/api/jobs/<job-id>/details?api-version=2019-06-17. Replace job-id with yours, e.g.: https://helix.dot.net/api/jobs/4eab1239-6bc1-4177-bd43-5b7af0c2725b/details?api-version=2019-06-17, find jobList: and download it.
  2. Find your test bits in the job list: "WorkItemId": "System.Net.Sockets.Tests". There’s CorrelationPayloadUrisWithDestinations containing runtime bits (.../test-runtime-net5.0-Linux-Debug-x64.zip?...) and "PayloadUri": containing test dll bits. Download them both.

Extract both somewhere in their own directories. The extracted files all have 0 permissions, you need at least give them read (chmod +r *). We’re interested in contents of test-runtime-net5.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0 and System.Net.Sockets.Tests.

  1. Load the dump: dotnet dump analyze core.1000.14431
  2. Load runtime (setclrpath <your-full-path-runtime-bits>): setclrpath /home/manicka/Downloads/test-runtime-net5.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0
  3. Test that runtime is loaded: clrmodules --> should print list of dlls
  4. Load PDBs: setsymbolserver -directory /home/manicka/Downloads/test-runtime-net5.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0 setsymbolserver -directory /home/manicka/Downloads/System.Net.Sockets.Tests –> this should give you file names and line numbers in clrstack.
  5. Use clrthreads, clrstack, dumpasync etc. to analyze the problem

Some more docs: https://github.com/dotnet/diagnostics/blob/master/documentation/debugging-coredump.md

This is intentionally written in greater detail for wider audience.

So far, it is running without getting stuck.

After 50k+ iterations, it still isn’t blocked. Does the issue happen often in CI?

I will take a look at the coredumps. Thank you @ManickaP for these clear instructions!

connect(AF_UNSPEC) supposed to unblock pending accept and connect calls? Isn’t it just a way for disconnecting an “already connected TCP socket” (sic. - PR description).

It causes operations to abort too. If it didn’t the xxxCanceledByDispose tests would fail.

but I think first we need to prove for 100% that the hang is caused by this specific kernel bug.

If you run this code, the bug should cause it to print errno EAFNOSUPPORT (97) as mentioned in the fix.

using System;
using System.Net;
using System.Net.Sockets;
using System.Runtime.InteropServices;

namespace console
{
    unsafe class Program
    {
        static void Main(string[] args)
        {
            Socket s = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
            s.Connect("www.microsoft.com", 443);

            // Connect to AF_UNSPEC
            const int AddressLength = 16;
            byte* address = stackalloc byte[AddressLength]; // note: AF_UNSPEC is zero.
            int rv = connect((int)s.Handle, address, AddressLength);
            if (rv == -1)
            {
                int errno = Marshal.GetLastWin32Error();
                Console.WriteLine($"fail, errno is {errno}");
            }
            else
            {
                Console.WriteLine("success");
            }
        }

        [DllImport("libc", SetLastError = true)]
        public static extern int connect(int socket, byte* address, uint address_len);
    }
}

5.0 supports RHEL 7+ (whatever is in support by RedHat)

The latest minor is the supported version.

Ok @MihaZupan was faster 😄