runtime: Socket Dispose during synchronous operation hangs on RedHat 7
Disposing a Socket that has a pending synchronous Connect/Accept operation intermittently hangs on RedHat7 and CentOS7 x64 in CI.
using var client = new Socket(SocketType.Stream, ProtocolType.Tcp);
Exception ex = await Assert.ThrowsAnyAsync<Exception>(async () =>
{
Task connectTask = Task.Run(() => client.Connect(endPoint));
await WaitForEventAsync(events, "ConnectStart"); // Wait until the operation starts
Task disposeTask = Task.Run(() => client.Dispose());
await new[] { connectTask, disposeTask }.WhenAllOrAnyFailed(); // Hang
});
Existing Connect and Accept tests are potentially masking this failure with the use of a timeout/retry combo.
Example CI Console log
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 49 (49 by maintainers)
Commits related to this issue
- Fix RHEL7 socket dispose hang, and extend coverage (#43409) Fix #42686 by doing a graceful close in case if the abortive connect(AF_UNSPEC) call fails on Linux, and improve couple of related tests: ... — committed to dotnet/runtime by antonfirsov 4 years ago
- Sync JsonCodeGen branch with runtime-master (#286) * Arm32 Crossgen2 initial support (#43243) - Fix type layout bugs - Sequential or Explicit layout classes without explicit field offsets on ar... — committed to layomia/dotnet_runtime by layomia 4 years ago
The bugzilla issue is now public: https://bugzilla.redhat.com/show_bug.cgi?id=1886305.
I’m using
dotnet dumpto analyze the dumps: https://github.com/dotnet/diagnostics/blob/master/documentation/dotnet-dump-instructions.mdWhen you download the coredump, you’ll also need the runtime bits with which it failed:
https://helix.dot.net/api/jobs/<job-id>/details?api-version=2019-06-17. Replace job-id with yours, e.g.: https://helix.dot.net/api/jobs/4eab1239-6bc1-4177-bd43-5b7af0c2725b/details?api-version=2019-06-17, findjobList:and download it."WorkItemId": "System.Net.Sockets.Tests". There’sCorrelationPayloadUrisWithDestinationscontaining runtime bits (.../test-runtime-net5.0-Linux-Debug-x64.zip?...) and"PayloadUri":containing test dll bits. Download them both.Extract both somewhere in their own directories. The extracted files all have 0 permissions, you need at least give them read (
chmod +r *). We’re interested in contents oftest-runtime-net5.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0andSystem.Net.Sockets.Tests.dotnet dump analyze core.1000.14431setclrpath <your-full-path-runtime-bits>):setclrpath /home/manicka/Downloads/test-runtime-net5.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0clrmodules--> should print list of dllssetsymbolserver -directory /home/manicka/Downloads/test-runtime-net5.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0setsymbolserver -directory /home/manicka/Downloads/System.Net.Sockets.Tests–> this should give you file names and line numbers inclrstack.clrthreads,clrstack,dumpasyncetc. to analyze the problemSome more docs: https://github.com/dotnet/diagnostics/blob/master/documentation/debugging-coredump.md
This is intentionally written in greater detail for wider audience.
After 50k+ iterations, it still isn’t blocked. Does the issue happen often in CI?
I will take a look at the coredumps. Thank you @ManickaP for these clear instructions!
It causes operations to abort too. If it didn’t the
xxxCanceledByDisposetests would fail.If you run this code, the bug should cause it to print errno
EAFNOSUPPORT(97) as mentioned in the fix.The latest minor is the supported version.
Ok @MihaZupan was faster 😄