go: net: 512 byte DNS response size limit causes "cannot unmarshal DNS" error

So, you found this issue googling for “cannot unmarshal DNS”

There’s good news: your issue has largely been fixed. The issue below was created initially because I discovered it in my network and operating system, but further discovery found that this issue has affected every major OS and users of VPNs, DNS providers written in Go, and more.

If you are a maintainer of code and someone has reported this issue: if you can update your build system to use Go 1.16.15 or 1.17.8, or Go 1.18, then you should see this go away and solve your users’ issues.

If you are a user of a program and see this error, you need to ask the maintainer or creator of that package to do likewise. Unfortunately, there isn’t a single set of instructions I can give for a workaround. If you’re using a VPN, try using that program not on a VPN; that seems to be the most common user-reported scenario I’ve seen.


Original bug report:

What version of Go are you using (go version)?

$ go version
go version go1.17.6 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

Note: WSL2 on Windows. This is relevant, but not the sole scenario in which it can occur, see below.

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/friel/.cache/go-build"
GOENV="/home/friel/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/friel/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/friel/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/friel/.local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/friel/.local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.6"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/friel/go/src/github.com/pulumi/pulumi-yaml/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3112884807=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Use infrastructure as code tools to manage Azure, and/or attempt to execute net.LookupIP("management.azure.com").

Example program:

package main

import (
	"fmt"
	"net"
)

func main() {
	ips, err := net.LookupIP("management.azure.com")
	if err != nil {
		panic(err)
	}
	for _, ip := range ips {
		fmt.Printf("%v", ip)
	}
}

What did you expect to see?

I expected to see the current IP, 13.86.219.80, as shown by the last line of:

$ host management.azure.com
management.azure.com is an alias for management.privatelink.azure.com.
management.privatelink.azure.com is an alias for arm-frontdoor-prod.trafficmanager.net.
arm-frontdoor-prod.trafficmanager.net is an alias for westus.management.azure.com.
westus.management.azure.com is an alias for arm-frontdoor-westus.trafficmanager.net.
arm-frontdoor-westus.trafficmanager.net is an alias for westus.cs.management.azure.com.
westus.cs.management.azure.com is an alias for rpfd-prod-by-01.cloudapp.net.
rpfd-prod-by-01.cloudapp.net has address 13.86.219.80

What did you see instead?

$ go run resolve-test.go 
panic: lookup management.azure.com on 172.20.32.1:53: cannot unmarshal DNS message

goroutine 1 [running]:
main.main()
        /home/friel/c/resolve-test/resolve-test.go:11 +0xe8
exit status 2

Miscellany

It looks like this issue is widely affecting infrastructure as code tools such as Pulumi, Terraform, and others when they make API calls to Microsoft Azure on the Windows Subsystem for Linux 2, on Microsoft Windows.

This is a bit of a rock and a hard place situation. Microsoft is unlikely to update their DNS server to adhere to the pre-1999 DNS specification. The Go language team is in a position to be much more agile and issue a point release update to support a larger buffer size, even just going up to a single standard MTU of ~1500 bytes would resolve this issue in the near term.

As this problem primarily affects programs written in Go, in this author’s estimation it seems unlikely a change in Windows’ DNS server behavior could occur as quickly, even if the stars were to align on the need to change the implementation. Note that host, dig, nslookup, etc all behave correctly.

Collected notes and root cause analysis:

DNS Flag Day 2020 had an explicit goal of ensuring that resolvers had a minimum accepted buffer size of 1232 bytes: https://dnsflagday.net/2020/#action-dns-resolver-operators

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 17
  • Comments: 38 (18 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks for the report.

Microsoft is unlikely to update their DNS server to adhere to the pre-1999 DNS specification.

Why not? It’s been a while since I’ve read DNS RFCs, but my impression is still today that DNS servers are not allowed to send >512-byte responses unless the client explicitly indicates support for such using EDNS.

As such, I feel like emphasizing “pre-1999” is unfair. I think Microsoft should update their DNS server to adhere to the DNS specification. I’d prefer we don’t add hacks to accommodate non-spec behavior.

However, #6464 remains open if someone wants to update Go’s DNS client to use EDNS, and to support+advertise a larger buffer size. I think that’s the standards-conforming way to address this issue, if folks aren’t willing to wait on the issue being fixed in WSL2.

@seankhliao

I would push back on the notion that this should be resolved elsewhere.

Go is the exception to behaving correctly: other userland programs such as dig(1), nslookup(1), host(1), as well as glibc API calls such as getaddrinfo(3) work. I can write Python, C#, Rust, C, etc, and those will work correctly in this networking environment.

Go is adhering strictly to an antiquated standard, EDNS0 has been a standard since 1999 and larger responses are not a new specification or the result of rapidly moving network standards or the ground shifting under Go. Strict adherence to 512 byte responses is not followed by other tools in the same ecosystem, Go ought to “be liberal in what it accepts”, within reason and of course, unless doing so would violate memory safety or other safety criteria of the software.

End-users are not in a position to solve their upstream DNS server’s issues, nor are software maintainers. We don’t have control over our end user’s DNS servers.

This error isn’t unique to the situation I described, it’s just most acute right now for those users in the specific scenario I documented. 112 issues have been reported on GitHub with the text “cannot unmarshal DNS”, and a survey of those shows that they have occurred across all platforms and among extraordinarily widely used pieces of software across Mac, Windows, *nix. Those issues show that various other VPN providers, ISPs, routers, have all behaved similarly. And going back to the earlier points, users don’t have control over those things and we shouldn’t expect all Go software users to be software engineers or to be able to modify their DNS configuration.

Lastly, I strongly believe that software that works is superior to software that does not, and end-users of the software will not care what link in the chain is causing it not to work.

There is an opportunity to mitigate an issue end-users are facing in one place, I think bringing Golang into alignment with the rest of the ecosystem will positively impact users.

@ianlancetaylor First, you’re right, the WSL2 DNS server is out of spec. No question there.

Second, let’s take a step back - this isn’t a WSL2 specific issue. Fixing the acute issue users are facing in WSL2 is WSL2 specific, but I’d encourage you to read the many, many comments on GitHub issues. https://github.com/search?o=asc&q=“cannot+unmarshal+DNS”&s=created&type=Issues

Starting with these issues which predate WSL2.

I’m using a red circle to indicate that a user’s problem was never solved, a yellow circle to indicate that a workaround was implemented to mitigate customer issues, but didn’t root cause them, and a green circle when a project that is actually a DNS server solved the issue. I’m also using GitHub Markdown’s list notation to provide partially unfurled data about the link destination via just pasting in URLs.

Consul

Confd

Docker

Kubernetes

Weave

rakyll/drive / odeke-em/drive

Mesos, again

Resolvable, a Docker DNS resolver

Goproxy

Moby / then Docker

  • 🔴 Various users report DNS not working, the workaround posted near the bottom can hardly be called such. 33 comments, more than a dozen users reporting issues. This is in 2016, so users were various Linux distributions. https://github.com/moby/moby/issues/20037
  • This is still an open issue.

freegeoip

  • 🔴 User gets an error when trying to perform a Get in a go application using Docker. https://github.com/fiorix/freegeoip/issues/160
  • “Could have been. I’ve literally just changed service providers from yesterday so I’m using different DNS servers. The error has gone away. Odd.”

heroku

clair

  • 🔴 user has “unmarshal DNS…” error on www.redhat.com, cannot use container scanning CLI tool clair https://github.com/quay/clair/issues/171
  • Clair maintainers don’t control user’s DNS server.

Docker for Mac

gorush application server

Docker for Mac

I think we should try, here, to solve customer, end-user problems.

We’ve identified two ways to do that already: have WSL2 fix their DNS server (https://github.com/microsoft/WSL/issues/7642), or implement #6464.

Workaround: We were able to work around the problem by adding a DNS entry in the hosts file: 51.107.60.33 management.azure.com When using WSL, the hostfile can be edited in Windows. %windir%\system32\drivers\etc\hosts and then restart the WSL. So at least we could use Terraform again.

At this stage of DNS I don’t see a reason to make EDNS(0) opt-in. It was always intended to be fully backward compatible. The edns0 option was added to glibc in 2007. I think it’s safe to use by default today.

Understood, though I’d like to chat with someone on the Go language team about the scope & impact of this issue. It’s affecting customers of major Go language-built software & has for about seven years. It’s particularly acute because, I suspect, none of the players wants to take responsibility for fixing this.

End users do not care why their software is broken, but we have an opportunity here to address, at least partially, thousands of issues raised by users over the past 7 years. And if the Pareto principle is applicable here, I suspect those users knowledgeable enough and motivated enough to comment on GitHub are just a fraction of those impacted.