runtime: Address "System.Net.Sockets.SocketException: Address already in use" on K8S/Linux using HttpClient/TCP

~Assumption: Duplicate of dotnet/runtime#27274 which was fixed by dotnet/corefx#32046 - goal: Port it (once confirmed it is truly duplicate).~ This is HttpClient/TCP spin off. UdpClient is covered fully by dotnet/runtime#27274.

Issue Title

“System.Net.Sockets.SocketException: Address already in use” on Linux

General

Our .net core(v 2.2.0) services are running on Azure Kubernettes Linux environment. Recently we experimenced a lot of error “System.Net.Http.HttpRequestException: Address already in use” while calling dependencies, e.g. Active Directory, CosmosDB and other services. Once the issue started, we kept getting the same errors and had to restart the service to get rid of it. Our http clients are using DNS address, not specific ip and port. The following is the call stack on one example. What can cause such issues and how to fix it?

System.Net.Http.HttpRequestException: Address already in use ---> 
System.Net.Sockets.SocketException: Address already in use#N#   
at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)#N#   --- 
End of inner exception stack trace ---#N#   

at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)#N#   
at System.Net.Http.HttpConnectionPool.CreateConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)#N#   
at System.Net.Http.HttpConnectionPool.WaitForCreatedConnectionAsync(ValueTask`1 creationTask)#N#   
at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)#N#   
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)#N#   
at System.Net.Http.DiagnosticsHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)#N#   
at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Http.HttpClientWrapper.GetResponseAsync()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Http.AdalHttpClient.GetResponseAsync[T](Boolean respondToDeviceAuthChallenge)#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Http.AdalHttpClient.GetResponseAsync[T]()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Flows.AcquireTokenHandlerBase.SendHttpMessageAsync(IRequestParameters requestParameters)#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Flows.AcquireTokenHandlerBase.SendTokenRequestAsync()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Flows.AcquireTokenHandlerBase.CheckAndAcquireTokenUsingBrokerAsync()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Flows.AcquireTokenHandlerBase.RunAsync()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.AuthenticationContext.AcquireTokenForClientCommonAsync(String resource, ClientKey clientKey)#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.AuthenticationContext.AcquireTokenAsync(String resource, ClientCredential clientCredential)#N#   

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 10
  • Comments: 138 (67 by maintainers)

Commits related to this issue

Most upvoted comments

Having the same issue on microsoft/dotnet:2.2-runtime-deps using ElasticSearch NEST 5.6.6. Very annoying issue. Can’t go back to 2.1 since invested a lot of time upgrading from 2.1 to 2.2. Upgrade to 3.0 Preview is not an option.

+1 to include this fix into next 2.2 release.

@wfurt has been doing a good job digging into this, and shared with me that he noticed something suspicious, that in a repro when analyzing it with SOS there ended up being a small number of Sockets on the heap but a large number of SafeSocketHandles. Based on that, I have a theory that this is due to https://github.com/dotnet/corefx/pull/32845 / https://github.com/dotnet/corefx/pull/32793. I don’t think it actually caused the problem so much as the bug it was fixing was actually masking this problem that’s existed for a long time.

SocketsHttpHandler creates a Socket for each connection. Each Socket creates a SafeSocketHandle (that’s its name in 3.0; prior to that it was internal and named SafeCloseSocket), a SafeHandle that wraps the underlying file descriptor (there’s actually a secondary SafeHandle in the middle, but that’s not relevant). On Unix, when the Socket is connected, it’s registered with the SocketAsyncEngine, which is the code responsible for running the event loop interacting with the epoll handle. Whenever the epoll wait shows that there’s work available to be done, the event loop maps the relevant file descriptor back to the appropriate SafeSocketHandle so that the relevant work can be performed and callbacks invoked. In order to do that mapping, the SocketAsyncEngine stores a ConcurrentDictionary<IntPtr, SafeSocketHandle>, and the engines themselves are stored in a static readonly SocketAsyncEngine[] array… the punchline here is that these SafeSocketHandles end up being strongly rooted by a static array.

The other important piece of information is that there’s a Timer inside SocketsHttpHandler that runs periodically to check whether connections in the connection pool are still viable, and if they’re not, Dispose’s of them. The bug that the aforementioned issues fixed was that there was an unexpected cycle formed between the timer and the connection pool that ended up keeping everything alive indefinitely, resulting in a non-trivial memory leak. However, as a side effect of that leak, it meant that the timer would continue to run, and every time it fired, it would loop through all of the open connections and Dispose of the ones that were no longer viable. In the fullness of time, all of them would get Dispose’d. Disposing of the connection would dispose of the Socket which would Dispose of the SafeSocketHandle and remove it from the SocketAsyncEngine’s dictionary.

Now, with the aforementioned fixes, if code fails to Dispose of the HttpClient/SocketsHttpHandler when done with them and drops the references to them, the timer gets collected, as does the connection pool, as do all of the HttpConnection objects in the pool. None of those have finalizers, nor should they need them. But here’s the rub. Socket does have a finalizer, yet its finalizer ends up being a nop. Since the storing of the SafeSocketHandle into the static dictionary isn’t something that can be undone automatically by GC, we actually need a finalizer to remove that registration should everything get dropped. Since all those objects don’t have finalizers, and since Socket’s finalizer isn’t doing the unregistration, everything gets collected above the SafeSocketHandle, which then remains registered effectively forever, never being disposed of, and never closing its file descriptor.

I don’t know for certain whether this is the cause of this issue. It’s just a theory, and @wfurt is working through the repro, debugging, and testing out theories. If this doesn’t turn out to be the root cause here, I suspect it’s still a bug we need to fix. If it does turn out to be the root cause, I don’t think the fix is to revert the aforementioned fixes: they were valid, they just revealed this existing problem they had been masking by creating a different leak that in turn allowed the timer to dispose of these resources… plus, this issue would apply to all uses of Sockets that weren’t disposed of, not just those used from SocketsHttpHandler. The actual fix would likely be to either use a weak reference when storing the SafeSocketHandle into the dictionary (which might be the right fix but could also potentially cause perf or otherwise unforeseen problems), or to ensure that a finalizer is put in place to undo that registration (most likely changing Socket’s finalizer accordingly on Unix).

In the meantime, assuming this is the cause, in addition to fixing it in System.Net.Sockets, code using HttpClient/HttpClientHandler/SocketsHttpHandler should also be Dispose’ing of those instances when done with them. If you just create a single HttpClient instance that’s stored in a static, there’s no real need to dispose of it, as everything will go away when the process ends. But if you’re doing something that creates an instance, uses it for one or more requests, and then get rid of it, when getting rid of it it should be disposed.

cc: @geoffkizer, @tmds

@sapleu do not update to 3 to fix this problem, as https://github.com/dotnet/core/issues/2253#issuecomment-482918706 states, this still happens in Core 3

I submitted fix to 3.0 master. It would be great if anybody can grab daily build and verify that it solves their issue.

Since this is somewhat generic error, there may be more than one issue under the cover. In either case any feedback would be useful.

kudos to @antonioortizpola who was able to isolate repro.

This is AWESOME! Thanks a lot @antonioortizpola, fingers crossed that we will be now able to quickly root-cause it and fix it in 3.0/2.2! 🙏

@yuezengms @yanrez @arsenhovhannisyan @antoinne85 @blurhkh @EvilBeaver @antonioortizpola @LukePulverenti @sapleu @rrudduck @rbrugnollo @robjwalker @OpenSourceAries I’d like to ask you for 2 favors:

  1. Can you please confirm if your repro is truly on HttpClient/TCP and NOT UdpClient? (please confirm you’re on HttpClient/TCP by upvoting this reply)
  2. Is any one of you in position to collect additional logs and work with us to root cause this problem? We would love to address it, but we have nothing actionable at this moment without help from someone who can hit the problem and can collect additional info. Thanks!

@karelz do you plan to backport this fix to 2.2 ?

@karelz Ok, after some hard work I could update to Core 3, I was excited since our project is a GRPC Server, and I tried the Grpc Template. Sadly after just around 8 hours working, we hit the same issue.

The project is very simple, it just receives the grpc request and make a http call or a WSDL call to an external service, this services has various response times, from 200 milliseconds to timeouts after one minute, then return the response object as is, no complex processing, no database connections or anything weird.

When the error starts happening, all the http clients start showing the errors, the direct ones and the ones coming form a WSDL definition.

HttpClient throwing exception image

WSDL Client throwing the same exception image

The csproj is

<Project Sdk="Microsoft.NET.Sdk.Web">

    <PropertyGroup>
        <TargetFramework>netcoreapp3.0</TargetFramework>
        <DockerDefaultTargetOS>Linux</DockerDefaultTargetOS>
    </PropertyGroup>

    <ItemGroup>
        <PackageReference Include="Grpc.AspNetCore.Server" Version="0.1.19-pre1" />
        <PackageReference Include="Microsoft.AspNet.WebApi.Client" Version="5.2.7" />
        <PackageReference Include="Microsoft.VisualStudio.Azure.Containers.Tools.Targets" Version="1.4.10" />
        <PackageReference Include="System.ServiceModel.Http" Version="4.5.3" />
    </ItemGroup>
    
</Project>

If it helps, we are running the project in an Amazon Linux in a EC2 instance with docker, the docker file is

FROM mcr.microsoft.com/dotnet/core/aspnet:3.0-stretch-slim AS base
WORKDIR /app
EXPOSE 80

FROM mcr.microsoft.com/dotnet/core/sdk:3.0-stretch AS build
WORKDIR /src
COPY ["vtae.myProject.gateway/vtae.myProject.gateway.csproj", "vtae.myProject.gateway/"]
COPY ["vtae.myProject.gateway.proto/vtae.myProject.gateway.proto.csproj", "vtae.myProject.gateway.proto/"]
RUN dotnet restore "vtae.myProject.gateway/vtae.myProject.gateway.csproj"
COPY . .
WORKDIR "/src/vtae.myProject.gateway"
RUN dotnet build "vtae.myProject.gateway.csproj" -c Release -o /app

FROM build AS publish
RUN dotnet publish "vtae.myProject.gateway.csproj" -c Release -o /app

FROM base AS final
WORKDIR /app
COPY --from=publish /app .
ENTRYPOINT ["dotnet", "vtae.myProject.gateway.dll"]

There was no increase in the CPU after the error, but no request was made successful after the first error shows up. Again, this was not happening in 2.1 but it is happening in 2.2 and 3.

All my http clients are Typed, I do not know if this dependency affects something

<PackageReference Include="System.ServiceModel.Http" Version="4.5.3" />

But I am using response.Content.ReadAsAsync<SomeClass>(); and _httpClient.PostAsJsonAsync(_serviceUrl, someRequestObject));

I would also like to know a way to stop the app form the app itself, so I can catch the exception and stop the server to let docker restart the container, I do not like the idea of doing just a System.Exit, but I could not find a way to do it in Core 3

EDIT

Ok, I ended up restarting the app, adding first a reference in Program.cs (A little dirty, but I guess is temporary until a fix is found).

public class Program
{
    public static IHost SystemHost { get; private set; }

    public static void Main(string[] args)
    {
        SystemHost = CreateHostBuilder(args).Build();
        SystemHost.Run();
    }

    public static IHostBuilder CreateHostBuilder(string[] args) =>
        Host.CreateDefaultBuilder(args)
            .ConfigureWebHostDefaults(webBuilder =>
            {
                webBuilder
                    .UseStartup<Startup>()
                    .ConfigureKestrel((context, options) => { options.Limits.MinRequestBodyDataRate = null; });
            });
}

Then in my interceptor I catch the exception with a contains. This is because if the error comes from a simple HttpClient, is thrown as HttpRequestException, but if comes from a WSDL services, is thrown as CommunicationException .

public async Task<T> ScopedLoggingExceptionWsdlActionService<T>(Func<TService, Task<T>> action)
{
    try
    {
        return await _scopedExecutorService.ScopedActionService(async service => await action(service));
    }
    catch (CommunicationException e)
    {
        await HandleAddressAlreadyInUseBug(e);
        var errorMessage = $"There was a communication error calling the wsdl service in '{typeof(TService)}' action '{action}'";
        _logger.LogError(e, errorMessage);
        throw new RpcException(new Status(StatusCode.Unavailable, errorMessage + ". Error message: " + e.Message));
    }
    catch (Exception e)
    {
        var errorMessage = $"There was an error calling the service '{typeof(TService)}' action '{action}'";
        _logger.LogError(e, errorMessage);
        throw new RpcException(new Status(StatusCode.Unknown, errorMessage + ". Error message: " + e.Message));
    }
}

// TODO: Remove this after https://github.com/dotnet/core/issues/2253 is fixed    
private async Task HandleAddressAlreadyInUseBug(Exception e)
{
    if (string.IsNullOrWhiteSpace(e.Message) || !e.Message.Contains("Address already in use"))
        return;
    var errorMessage = "Hitting bug 'Address already in use', stopping server to force restart. More info at https://github.com/dotnet/core/issues/2253";
    _logger.LogCritical(e, errorMessage);
    await Program.SystemHost.StopAsync();
    throw new RpcException(new Status(StatusCode.ResourceExhausted, errorMessage + ". Error message: " + e.Message));
}

Can you please try .NET Core 3.0? It was fixed there …

possibly. It will be easier to get approval if we can confirm that this fix solves observed issues. e.g try 3.0 before and after.

Assumption: Duplicate of dotnet/runtime#27274 which was fixed by dotnet/corefx#32046 - goal: Port it (once confirmed it is truly duplicate).

This assumption is not correct. The fix is for UDP, the issues reported here are for HTTP (which is TCP).

Getting “Address already in use” on a TCP connect is weird. If the local end isn’t bound, it should pick a port that is not in use. You may be running out of port numbers. Running netstat can help you find out what sockets are around and who owns them.

I will follow up offline, but in some of the regions we see it happening more often - taking down several pods in our k8s cluster per day. It’s very annoying at the moment, causing us few dev-hours a day to act on it and mitigate. We are also looking into automated mitigation using liveness probe wired into check if we start getting these exceptions and signalling k8s to kill the pod. Unfortunately, it’s also non-trivial amount of dev work to build and deploy. Considering we can’t exactly predict frequency of the issue, risk is that liveness pod might still impact our availability and cause us missing SLA.

2.2 port is waiting for verification @alxbog. We need to get enough evidence that dotnet/corefx#38499 fixes it or we need separate repro for 2.2. Until then it is unlikely we get permission for 2.x changes.

Will it be possible to get this fix in 2.2?

@wfurt, thanks a lot for your investigation! I will change the code so the service closes the connection.

I am glad the repository could help to replicate the problem and I hope it could help others.

Please correct me if I am wrong, but I think the main problems are:

  • People who is not using HttpClient correctly.
  • People who use libraries that use sockets or HttpClient and make assumptions based in previous behaviors (like WCF and me).

For the first group, please make sure you are using HttpClient correctly, most probably, it will fix the problem and improve your system.

For the second group, search for methods that could close the connection or IDisposable interfaces, and make tests monitoring your sockets (like with netstat -natu) to check if it can help fix the problem. Or check if you can reuse your connections.

If the problem persist tell us how are you using the client or socket or library, and if possible create a simple repository with a reproduction case.

Ok, I have the repo, I invited @karelz and @wfurt, I hope this can help with something, please let me know if I can help with something else.

@softworkz thank you !

@karelz Yes it would be great to get this back-ported because ever since the 2.1 release we’ve had to tell users to shutdown all other upnp or dlna software on the machine in order to prevent this from happening.

It seems like we mixing multiple issues here. Part of the discussion is about UDP and part about HttpClient.

Same problem with NEST elasticsearch client on Linux under Core 2.2. Backporting fix to 2.2 would be nice

does not dispose even though it fits your example of where it should be disposed

Yes, the client in the GetPage method in that sample should be disposed. Thanks for pointing that out. cc: @glennc, @rynowak

I think the main problems are

There are two issues here:

  1. There’s a bug in Sockets on Unix where if you allow a connected Socket to be collected without it having been disposed, it’ll leak the underlying handle.
  2. Consumers of HttpClient are sometimes not using it correctly, leading to the above bug getting triggered.

Fixing either of those is technically sufficient to address this issue, although even when (1) is fixed, it’s important for (2) to be done, as the fix for (1) is still non-deterministic and could take an unbounded amount of time to kick-in.

The repro was extremely useful, thanks @antonioortizpola.

In either case we should not leak OS resource and we do right now in some cases with 2.2 code. Disposing explicitly is best as everything is released when not needed. In the other case socket can stay opened until GC kicks in and that may take some time depending on many variables.

I’m making some progress. On the note above: If you add cenamOperationClient.Close() to GetSubscriberDetailsF() after response is received to close wfc, there are no lingering sockets at all @antonioortizpola With old platform handlers, socket can be closed independently but now usual reference counting works and disposing HttpClient when not used can lead to delayed release of resources. So as any reference to HttpResponseStream would keep underlying socket open. I think there is definitely issue with 2.2+ but there can be more than one reason for observed behavior.

@jarlehal typo, fixed, thanks for pointing it out.

thanks @antonioortizpola, I will take a look. Are you suggesting 3.0 fixes the problem?

BTW any chance @antonioortizpola that you can share core dump from time when it is failing? (it does not need to reach port exhaustion - we just need few sockets in half-closed state) It will be large and it can contain any secrets or private data. But if we can work it out I think we would be able to sort this out. If that not possible, we may be able to script dump file processing or I can guide you through sequence to get useful info out.

thanks @antonioortizpola . it would be nice to get to bottom of this. Seeing half-closed TCP is certainly clue.

@dmiller02 are you in position to get back into bad state and help us collect some logs?

Shouldn’t be too difficult. We have the images for the service in question and can re-create the error. What logging would you need?

Not quite. We’re not the only ones referring to the UDP bug here.

Yes, this is causing confusion, so it’s good to make clear the difference. The issue reported here is for HttpClient/TCP, and it was assumed the UDP fix would solve it, which is not the case.

Thanks @softworkz for repro!!! That is a HUGE step towards root cause and solution. Let’s hope we can reproduce it too 😃

@karelz @softworkz is talking about a UDP issue https://github.com/dotnet/corefx/issues/32027 which was decided not to be backported: https://github.com/dotnet/corefx/issues/32027#issuecomment-418447086.

The main issue reported here is a TCP issue observed when using HttpClient.

Based on the replies here, I don’t think it will be simple to create a repro (although it would be most helpful). I would recommend to get any repro environment where we can experiment - collect more data, try private builds, etc. If anyone has such environment (incl. production) where they can experiment and work closely with us, please let me know and let’s dig deeper into it …

Just to confirm: Did anyone hit it in 2.1 at all?

App working for 6 months on 2.1 without any issues, now happening on 2.2.

I’m trying to filter exactly which call is throwing the error, so I can isolate and run more tests.

Just to confirm: Did anyone hit it in 2.1 at all?

Not for us, I can confirm, no problems in 3 months with 2.1, the problem started the day we switched to 2.2.

I am in the process of make my WSDL clients singletons, I hope to end by the next week, that way I could confirm if it is a problem with the HttpClient alone or if WCF is making something wrong

could you please tell us exactly what is wrong so they can make the updates?

SocketsHttpHandler, which is the default handler implementation backing the HttpClient starting in .NET Core 2.1, has two properties on it: PooledConnectionIdleTimeout (https://docs.microsoft.com/en-us/dotnet/api/system.net.http.socketshttphandler.pooledconnectionidletimeout?view=netcore-2.2) and PooledConnectionLifetime (https://docs.microsoft.com/en-us/dotnet/api/system.net.http.socketshttphandler.pooledconnectionlifetime?view=netcore-2.2). The latter governs how long a connection is allowed to be reused. It defaults to Infinite, which means it won’t be proactively torn down by the client for reasons of how long its been around. But if you set it to something shorter, like 5 minutes, that tells the handler that once the connection has been around for 5 minutes, it shouldn’t be reused again and the handler will prevent further requests from using it and will tear it down. Any subsequent requests would be forced to get a new connection, which will again consult DNS to determine to where the connection should be opened.

@stephentoub well, if the docs are wrong then I am lost.

The last update is from 01/06/2019, should I ask for an update?, could you please tell us exactly what is wrong so they can make the updates?

Also, if the best solution is to keep the HttpClient for as long as you can, I should not be better just use it as singleton? this would render the IHttpClientFactory pretty much useless, it would be just a fancy name for singleton.

Also it would be great to make that clear in the Core documentation, that you can use HttpClient as singleton should be an option in the “Making Http requests” part, since it still states:

Manages the pooling and lifetime of underlying HttpClientMessageHandler instances to avoid common DNS problems that occur when manually managing HttpClient lifetimes

@rbrugnollo according to the docs, you should not be using HttpClient directly, you should be using a IHttpClientFactory or any other type of client (Named, typed or generated).

Our team is using typed clients with the rest requests, so there sould not be a problem, however, thinking more deeply, with the WSDL clients we do not have access to the httpClient directly, I do not know if that could be related to the socket exhaustion problem, in which case I would not know how to make a fix or workaround, unless i drop all my wsdl clients and use direct requests, but this is too much work and basically would be dropping support for wsdl clients.

When I run netstat I do not see antything weird, the ports looks prety much the same than with 2.1.

Still waiting for someone to have an environment where it happens with some frequence (aka production repro) and who can try to deploy private patch out of 2.1 or 2.1 branch. Do we have someone like that? Without that this issue is sadly blocked …

@karelz, I already updated to 3 and the problem still exists, the error shows up in 8-12 hours, is there anything else that I can do to help with the problem?

I know that bing is running in core 2.1, have you update yourselves to 2.2? This problem is becoming really frustrating, I do not understand how a simple project that just call some http services is causing this issue. This si really causing trust issues in the team, now I want to update for security updates but I am not sure something internal and hiden is going to be broken for the next release.

We just got hit by this as well. It’s very rare, but I’m (somewhat) glad to see it’s a know issue.

@karelz I have an app with a large number of users affected by this. Any Dlna media app will be affected by this. A back-port would be much appreciated. Thanks.

We have to restart our docker containers to fix the problem

I’m doing the same. I set docker container mode “restart=always” and in my app. Then I’m catching this SocketException. If it’s caught - i’m killing the app and docker engine restarts it. Works fine, but for complex apps it should be fixed properly at dotnet level.

BTW: It might be good for you to register as MS employees - at least by linking your accounts: https://github.com/dotnet/core/blob/master/Documentation/microsoft-team.md … that allows other FTEs to see you are MSFT 😉

It is happening in some of the clusters. It doesn’t seem to consistently repro, but some pods go into this state and stay in it until being terminated. I understand 3.0 is still few months away (I don’t know actual timeline though), so my question about 2.2 was based on assumption that hotfix for 2.2 could come earlier than 3.0 release. We will look into trying to upgrade and see if it solves the issue.

I’ll try upgrading. Thanks!