go: net: 512 byte DNS response size limit causes "cannot unmarshal DNS" error
So, you found this issue googling for “cannot unmarshal DNS”
There’s good news: your issue has largely been fixed. The issue below was created initially because I discovered it in my network and operating system, but further discovery found that this issue has affected every major OS and users of VPNs, DNS providers written in Go, and more.
If you are a maintainer of code and someone has reported this issue: if you can update your build system to use Go 1.16.15 or 1.17.8, or Go 1.18, then you should see this go away and solve your users’ issues.
If you are a user of a program and see this error, you need to ask the maintainer or creator of that package to do likewise. Unfortunately, there isn’t a single set of instructions I can give for a workaround. If you’re using a VPN, try using that program not on a VPN; that seems to be the most common user-reported scenario I’ve seen.
Original bug report:
What version of Go are you using (go version)?
$ go version go version go1.17.6 linux/amd64
Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (go env)?
Note: WSL2 on Windows. This is relevant, but not the sole scenario in which it can occur, see below.
go env Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/home/friel/.cache/go-build" GOENV="/home/friel/.config/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GOMODCACHE="/home/friel/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/home/friel/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/home/friel/.local/go" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/home/friel/.local/go/pkg/tool/linux_amd64" GOVCS="" GOVERSION="go1.17.6" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="/home/friel/go/src/github.com/pulumi/pulumi-yaml/go.mod" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3112884807=/tmp/go-build -gno-record-gcc-switches"
What did you do?
Use infrastructure as code tools to manage Azure, and/or attempt to execute net.LookupIP("management.azure.com").
Example program:
package main
import (
"fmt"
"net"
)
func main() {
ips, err := net.LookupIP("management.azure.com")
if err != nil {
panic(err)
}
for _, ip := range ips {
fmt.Printf("%v", ip)
}
}
What did you expect to see?
I expected to see the current IP, 13.86.219.80, as shown by the last line of:
$ host management.azure.com
management.azure.com is an alias for management.privatelink.azure.com.
management.privatelink.azure.com is an alias for arm-frontdoor-prod.trafficmanager.net.
arm-frontdoor-prod.trafficmanager.net is an alias for westus.management.azure.com.
westus.management.azure.com is an alias for arm-frontdoor-westus.trafficmanager.net.
arm-frontdoor-westus.trafficmanager.net is an alias for westus.cs.management.azure.com.
westus.cs.management.azure.com is an alias for rpfd-prod-by-01.cloudapp.net.
rpfd-prod-by-01.cloudapp.net has address 13.86.219.80
What did you see instead?
$ go run resolve-test.go
panic: lookup management.azure.com on 172.20.32.1:53: cannot unmarshal DNS message
goroutine 1 [running]:
main.main()
/home/friel/c/resolve-test/resolve-test.go:11 +0xe8
exit status 2
Miscellany
It looks like this issue is widely affecting infrastructure as code tools such as Pulumi, Terraform, and others when they make API calls to Microsoft Azure on the Windows Subsystem for Linux 2, on Microsoft Windows.
This is a bit of a rock and a hard place situation. Microsoft is unlikely to update their DNS server to adhere to the pre-1999 DNS specification. The Go language team is in a position to be much more agile and issue a point release update to support a larger buffer size, even just going up to a single standard MTU of ~1500 bytes would resolve this issue in the near term.
As this problem primarily affects programs written in Go, in this author’s estimation it seems unlikely a change in Windows’ DNS server behavior could occur as quickly, even if the stars were to align on the need to change the implementation. Note that host, dig, nslookup, etc all behave correctly.
Collected notes and root cause analysis:
- Microsoft Windows WSL2 uses a DNS server that sends additional metadata causing it to send responses larger than expected, & will exceed the 512 byte response size mandated by DNS RFC (https://github.com/microsoft/WSL/issues/5806, https://github.com/microsoft/WSL/issues/7642)
- Golang’s net/dns resolver applies a strict 512 byte limit to the buffer it will fill with a response (https://github.com/golang/go/issues/21160, https://github.com/golang/go/issues/44135)
- Microsoft Azure appears to have added a new cname to their management.azure.com endpoint, likely within the last week (?) pushing the response size, due to (1) over the 512 byte limit, and causing due to (2) a cannot unmarshal DNS message error. (https://github.com/microsoft/WSL/issues/5806#issuecomment-1034053175, https://github.com/golang/go/issues/44135#issuecomment-1034062312, https://github.com/microsoft/WSL/issues/8022, https://github.com/microsoft/WSL/issues/7642#issuecomment-1032751472)
DNS Flag Day 2020 had an explicit goal of ensuring that resolvers had a minimum accepted buffer size of 1232 bytes: https://dnsflagday.net/2020/#action-dns-resolver-operators
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 17
- Comments: 38 (18 by maintainers)
Commits related to this issue
- net/dns: Increase UDP response buffer to 1232 bytes This resolves #51127 in the near term by defaulting to a larger buffer size. This is not a permanent fix or implementation of EDNS(0) or [IETF RFC6... — committed to AaronFriel/go by AaronFriel 2 years ago
- net: send EDNS(0) packet length in DNS query We used to only accept up to 512 bytes in a DNS packet, per RFC 1035. Increase the size we accept to 1232 bytes, per https://dnsflagday.net/2020/, and adv... — committed to golang/go by ianlancetaylor 2 years ago
- Revert "net: send EDNS(0) packet length in DNS query" This reverts https://go.dev/cl/385035. For 1.18 we will use a simple change to increase the accepted DNS packet size, to handle what appear to be... — committed to golang/go by ianlancetaylor 2 years ago
- net: increase maximum accepted DNS packet to 1232 bytes The existing value of 512 bytes as is specified by RFC 1035. However, the WSL resolver reportedly sends larger packets without setting the trun... — committed to golang/go by ianlancetaylor 2 years ago
- [release-branch.go1.17] net: increase maximum accepted DNS packet to 1232 bytes The existing value of 512 bytes as is specified by RFC 1035. However, the WSL resolver reportedly sends larger packets ... — committed to golang/go by ianlancetaylor 2 years ago
- [release-branch.go1.16] net: increase maximum accepted DNS packet to 1232 bytes The existing value of 512 bytes as is specified by RFC 1035. However, the WSL resolver reportedly sends larger packets ... — committed to golang/go by ianlancetaylor 2 years ago
- net: send EDNS(0) packet length in DNS query Advertise to DNS resolvers that we are willing and able to accept up to 1232 bytes in a DNS packet. The value 1232 was chosen based on https://dnsflagday.... — committed to golang/go by ianlancetaylor 2 years ago
- Bump Go version to 1.16.15, prepare 3.2.2 release Reference: https://github.com/golang/go/issues/51127 Reference: https://github.com/hashicorp/terraform-provider-dns/issues/157 Reference: https://git... — committed to hashicorp/terraform-provider-dns by bflad 2 years ago
- Bump Go version to 1.16.15, prepare 3.2.2 release (#199) Reference: https://github.com/golang/go/issues/51127 Reference: https://github.com/hashicorp/terraform-provider-dns/issues/157 Reference: ht... — committed to hashicorp/terraform-provider-dns by bflad 2 years ago
Thanks for the report.
Why not? It’s been a while since I’ve read DNS RFCs, but my impression is still today that DNS servers are not allowed to send >512-byte responses unless the client explicitly indicates support for such using EDNS.
As such, I feel like emphasizing “pre-1999” is unfair. I think Microsoft should update their DNS server to adhere to the DNS specification. I’d prefer we don’t add hacks to accommodate non-spec behavior.
However, #6464 remains open if someone wants to update Go’s DNS client to use EDNS, and to support+advertise a larger buffer size. I think that’s the standards-conforming way to address this issue, if folks aren’t willing to wait on the issue being fixed in WSL2.
@seankhliao
I would push back on the notion that this should be resolved elsewhere.
Go is the exception to behaving correctly: other userland programs such as dig(1), nslookup(1), host(1), as well as glibc API calls such as getaddrinfo(3) work. I can write Python, C#, Rust, C, etc, and those will work correctly in this networking environment.
Go is adhering strictly to an antiquated standard, EDNS0 has been a standard since 1999 and larger responses are not a new specification or the result of rapidly moving network standards or the ground shifting under Go. Strict adherence to 512 byte responses is not followed by other tools in the same ecosystem, Go ought to “be liberal in what it accepts”, within reason and of course, unless doing so would violate memory safety or other safety criteria of the software.
End-users are not in a position to solve their upstream DNS server’s issues, nor are software maintainers. We don’t have control over our end user’s DNS servers.
This error isn’t unique to the situation I described, it’s just most acute right now for those users in the specific scenario I documented. 112 issues have been reported on GitHub with the text “cannot unmarshal DNS”, and a survey of those shows that they have occurred across all platforms and among extraordinarily widely used pieces of software across Mac, Windows, *nix. Those issues show that various other VPN providers, ISPs, routers, have all behaved similarly. And going back to the earlier points, users don’t have control over those things and we shouldn’t expect all Go software users to be software engineers or to be able to modify their DNS configuration.
Lastly, I strongly believe that software that works is superior to software that does not, and end-users of the software will not care what link in the chain is causing it not to work.
There is an opportunity to mitigate an issue end-users are facing in one place, I think bringing Golang into alignment with the rest of the ecosystem will positively impact users.
@ianlancetaylor First, you’re right, the WSL2 DNS server is out of spec. No question there.
Second, let’s take a step back - this isn’t a WSL2 specific issue. Fixing the acute issue users are facing in WSL2 is WSL2 specific, but I’d encourage you to read the many, many comments on GitHub issues. https://github.com/search?o=asc&q=“cannot+unmarshal+DNS”&s=created&type=Issues
Starting with these issues which predate WSL2.
I’m using a red circle to indicate that a user’s problem was never solved, a yellow circle to indicate that a workaround was implemented to mitigate customer issues, but didn’t root cause them, and a green circle when a project that is actually a DNS server solved the issue. I’m also using GitHub Markdown’s list notation to provide partially unfurled data about the link destination via just pasting in URLs.
Consul
Confd
Docker
Kubernetes
Weave
rakyll/drive / odeke-em/drive
Mesos, again
Resolvable, a Docker DNS resolver
Goproxy
Moby / then Docker
freegeoip
heroku
clair
Docker for Mac
gorush application server
Docker for Mac
We’ve identified two ways to do that already: have WSL2 fix their DNS server (https://github.com/microsoft/WSL/issues/7642), or implement #6464.
Workaround: We were able to work around the problem by adding a DNS entry in the hosts file: 51.107.60.33 management.azure.com When using WSL, the hostfile can be edited in Windows. %windir%\system32\drivers\etc\hosts and then restart the WSL. So at least we could use Terraform again.
At this stage of DNS I don’t see a reason to make EDNS(0) opt-in. It was always intended to be fully backward compatible. The
edns0option was added to glibc in 2007. I think it’s safe to use by default today.Understood, though I’d like to chat with someone on the Go language team about the scope & impact of this issue. It’s affecting customers of major Go language-built software & has for about seven years. It’s particularly acute because, I suspect, none of the players wants to take responsibility for fixing this.
End users do not care why their software is broken, but we have an opportunity here to address, at least partially, thousands of issues raised by users over the past 7 years. And if the Pareto principle is applicable here, I suspect those users knowledgeable enough and motivated enough to comment on GitHub are just a fraction of those impacted.