NuGetGallery: DNS lookups fail for api.nuget.org using Alpine based dotnet Docker images in AWS us-east-1

Impact

I’m unable to use api.nuget.org from inside of an Alpine based docker image running in AWS us-east-1.

Describe the bug

We’ve found an issue where running an Alpine based dotnet image inside of AWS us-east-1 (e.g running an image on an EC2 instance with Docker) causes DNS lookups to api.nuget.org to fail, breaking many tools that integrate with Nuget. Ive noticed this behaviour affecting builds running in Bitbucket Pipelines (our CI/CD service), and have reproduced similar issues directly on EC2. This happens when using Route53 as the DNS resolver (the default when starting up a new EC2 instance).

It appears that the problem is due to Alpine’s inability to handle truncated DNS responses. If running dig to perform a DNS lookup for api.nuget.org , we notice the tc flag set in the response headers, indicating a truncated DNS response. The following was executed from an EC2 instance in us-east-1. We’ve found truncation does not occur in us-west-2. In the below response. we don’t receive any A records for api.nuget.org due to the truncated response.

+ dig +noedns +ignore api.nuget.org
; <<>> DiG 9.18.11 <<>> +noedns +ignore api.nuget.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42774
;; flags: qr tc rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;api.nuget.org.			IN	A
;; ANSWER SECTION:
api.nuget.org.		22	IN	CNAME	nugetapiprod.trafficmanager.net.
nugetapiprod.trafficmanager.net. 22 IN	CNAME	apiprod-mscdn.azureedge.net.
apiprod-mscdn.azureedge.net. 300 IN	CNAME	apiprod-mscdn.afd.azureedge.net.
apiprod-mscdn.afd.azureedge.net. 6 IN	CNAME	star-azureedge-prod.trafficmanager.net.
star-azureedge-prod.trafficmanager.net.	55 IN CNAME shed.dual-low.part-0012.t-0009.fdv2-t-msedge.net.
shed.dual-low.part-0012.t-0009.fdv2-t-msedge.net. 172 IN CNAME global-entry-afdthirdparty-fallback-first.trafficmanager.net.
global-entry-afdthirdparty-fallback-first.trafficmanager.net. 49 IN CNAME shed.dual-low.part-0012.t-0009.fb-t-msedge.net.
shed.dual-low.part-0012.t-0009.fb-t-msedge.net.	49 IN CNAME part-0012.t-0009.fb-t-msedge.net.
;; Query time: 0 msec
;; SERVER: 10.30.0.2#53(10.30.0.2) (UDP)
;; WHEN: Wed Feb 22 04:13:48 UTC 2023
;; MSG SIZE  rcvd: 366

Running the same query from us-west-2 gives back a correct response with a A record:

; <<>> DiG 9.18.11 <<>> +noedns +ignore api.nuget.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44616
;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;api.nuget.org.			IN	A
;; ANSWER SECTION:
api.nuget.org.		179	IN	CNAME	nugetapiprod.trafficmanager.net.
nugetapiprod.trafficmanager.net. 179 IN	CNAME	apiprod-mscdn.azureedge.net.
apiprod-mscdn.azureedge.net. 300 IN	CNAME	apiprod-mscdn.afd.azureedge.net.
apiprod-mscdn.afd.azureedge.net. 30 IN	CNAME	star-azureedge-prod.trafficmanager.net.
star-azureedge-prod.trafficmanager.net.	10 IN CNAME shed.dual-low.part-0012.t-0009.fdv2-t-msedge.net.
shed.dual-low.part-0012.t-0009.fdv2-t-msedge.net. 30 IN	CNAME part-0012.t-0009.fdv2-t-msedge.net.
part-0012.t-0009.fdv2-t-msedge.net. 42 IN A	13.107.238.40
part-0012.t-0009.fdv2-t-msedge.net. 42 IN A	13.107.237.40
;; Query time: 0 msec
;; SERVER: 10.30.0.2#53(10.30.0.2) (UDP)
;; WHEN: Wed Feb 22 04:24:17 UTC 2023
;; MSG SIZE  rcvd: 356

This prevents the use of Alpine based docker images running in us-east-1 and using Route53 for DNS services from communicating with nuget. Swapping to an alternative DNS provider such as Cloudflare at 1.1.1.1 or hardcoding api.nuget.org in /etc/hosts resolves the problem. It’s unclear if this is a problem with AWS, nuget, or a combination of the two.

Maybe something has changed causing the nuget DNS query responses to increase in size, breaking Alpine? Comparing the above responses from a DNS lookup in us-east-1 vs us-west-2, we see in us-east-1 that there are several additional CNAME entries. Alpine truncates DNS responses that exceed 512 bytes in size (see https://christoph.luppri.ch/fixing-dns-resolution-for-ruby-on-alpine-linux). In this case, we are unable to use any dotnet alpine image to talk to nuget from AWS in us-east-1.

Repro Steps

Steps to reproduce:

  • launch an EC2 instance with Docker installed into AWS us-east-1 region
  • start up any alpine image on the instance
  • Runwget api.nuget.org
  • Observe that hostname resolution fails.

Expected Behavior

We can successfully call api.nuget.org (however it will fail with a http 4xx response without appropriate credentials and path).

Screenshots

No response

Additional Context and logs

We’ve detected this issue inside of Bitbucket Pipelines, and can reproduce this directly on EC2 instances across unrelated AWS accounts where Route53 is used as a DNS resolver.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 4
  • Comments: 20 (10 by maintainers)

Most upvoted comments

Hi! @ggatus-atlassian, @nslusher-sf, @RichardD012, @KondorosiAttila, @Boojapho, @mhoeper, @ilia-cy. Our apologies for the inconvenience again! Please take a look at this issue https://github.com/NuGet/NuGetGallery/issues/9736 for the root cause and next steps. Feel free to reach out to us at support@nuget.org or by commenting on the discussion issue: https://github.com/NuGet/Home/discussions/12985. Thanks!

However, till this is implemented, will Nuget.org try not exceeding the 512 byte limit? Otherwise, we would plan migrating out of Alpine…

@mhoeper, it’s currently not possible for us to guarantee that the 512 byte limit will not be exceeded. After further conversations with our primary CDN provider, this case occurs when there is “shedding”, which appears to be a relatively rare case when the CDN determines it needs to provide an alternate DNS chain, likely due to high load in the area. This would align with the context where the impacted customers are in a highly popular AWS region.

However, given the relatively narrow scope of the impact (Alpine Linux plus AWS regions which encounter CDN shedding), we may need to revert to the previous state if no better solution is found. We’ve mitigated the current situation by using our secondary CDN provider, which happens to have a smaller DNS response size. But we can’t use this solution forever for scalability reasons.

After doing some research online, this seems to be a common problem for Alpine users (not just NuGet.org, not just .NET, not just Docker). I believe the retry over TCP is the proper solution for Alpine, but I can’t speak authoritatively since I’m not an expert in musl libc (Alpine’s libc implementation which yields this problem) or Alpine’s desired use-cases. I also don’t know the timeline for the Alpine/musl addressing this problem. It is likely a much longer timeline that we want to be using our secondary CDN provider.

I’ll gently suggest moving to a non-Alpine Docker image in the short term to avoid any of these DNS problems. Alpine should probably be fine for runtime cases where NuGet.org is not needed but for SDK cases, it’s probably best to avoid Alpine until this issue is resolved one way or another.

We’re speaking internally to our partners about alternatives both in the CDN space and in the Docker image configuration. I can’t guarantee any solution since these settings are outside of the NuGet.org team’s control.