go: runtime: mlock of signal stack failed: 12

What version of Go are you using (go version)?

$ go version
go version go1.14rc1 linux/amd64

Does this issue reproduce with the latest release?

I hit this with golang:1.14-rc-alpine docker image, the error does not happen in 1.13.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build968395959=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Clone https://github.com/ethereum/go-ethereum, replace the builder version in Dockerfile to golang:1.14-rc-alpine (or use the Dockerfile from below), then from the root build the docker image:

$ docker build .

FROM golang:1.14-rc-alpine

RUN apk add --no-cache make gcc musl-dev linux-headers git

ADD . /go-ethereum
RUN cd /go-ethereum && make geth

What did you expect to see?

Go should run our build scripts successfully.

What did you see instead?

Step 4/9 : RUN cd /go-ethereum && make geth
 ---> Running in 67781151653c
env GO111MODULE=on go run build/ci.go install ./cmd/geth
runtime: mlock of signal stack failed: 12
runtime: increase the mlock limit (ulimit -l) or
runtime: update your kernel to 5.3.15+, 5.4.2+, or 5.5+
fatal error: mlock failed

runtime stack:
runtime.throw(0xa3b461, 0xc)
	/usr/local/go/src/runtime/panic.go:1112 +0x72
runtime.mlockGsignal(0xc0004a8a80)
	/usr/local/go/src/runtime/os_linux_x86.go:72 +0x107
runtime.mpreinit(0xc000401880)
	/usr/local/go/src/runtime/os_linux.go:341 +0x78
runtime.mcommoninit(0xc000401880)
	/usr/local/go/src/runtime/proc.go:630 +0x108
runtime.allocm(0xc000033800, 0xa82400, 0x0)
	/usr/local/go/src/runtime/proc.go:1390 +0x14e
runtime.newm(0xa82400, 0xc000033800)
	/usr/local/go/src/runtime/proc.go:1704 +0x39
runtime.startm(0x0, 0xc000402901)
	/usr/local/go/src/runtime/proc.go:1869 +0x12a
runtime.wakep(...)
	/usr/local/go/src/runtime/proc.go:1953
runtime.resetspinning()
	/usr/local/go/src/runtime/proc.go:2415 +0x93
runtime.schedule()
	/usr/local/go/src/runtime/proc.go:2527 +0x2de
runtime.mstart1()
	/usr/local/go/src/runtime/proc.go:1104 +0x8e
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1062 +0x6e

...
make: *** [Makefile:16: geth] Error 2

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 29
  • Comments: 131 (90 by maintainers)

Commits related to this issue

Most upvoted comments

The kernel bug manifested as random memory corruption in Go 1.13 (both with and without preemptive scheduling). What is new in Go 1.14 is that we detect the presence of the bug, attempt to work around it, and prefer to crash early and loudly if that is not possible. You can see the details in the issue I referred you to.

Since you have called me dishonest and nasty, I will remind you again about the code of conduct: https://golang.org/conduct. I am also done participating in this conversation.

So, the official solution to Go crashing is to point fingers to everyone else to hack around your code? Makes sense.

A little summary because I had to piece it together myself:

So it seems like Ubuntu’s kernel is patched, but the workaround gets enabled anyways.

This issue does not happen with Go 1.13. Ergo, it is a bug introduced in Go 1.14.

Saying you can’t fix it and telling people to use workarounds it is dishonest, because reverting a piece of code would actually fix it. An alternative solution would be to detect the problematic platforms / kernels and provide a fallback mechanism baked into Go.

Telling people to use a different kernel is especially nasty, because it’s not as if most people can go around and build themselves a new kernel. If alpine doesn’t release a new kernel, there’s not much most devs can do. And lastly if your project relies on a stable infrastructure where you can’t just swap out kernels, you’re again in a pickle.

It is standard practice to redirect issues to the correct issue tracking system.

The fact that Go crashes is not the fault of docker. Redirecting a Go crash to a docker repo is deflection.

@karalabe I would like to remind you of https://golang.org/conduct. In particular, please be respectful and be charitable.

Please answer the question

Based on discussion with @aclements, @dr2chase, @randall77, and others, our plan for the 1.14.1 release is:

  • write a wiki page describing the problem
  • continue to use mlock on a kernel version that may be buggy
  • if mlock fails, silently note that fact and continue executing
  • if we see an unexpected SIGSEGV or SIGBUS, and mlock failed, then in the crash stack trace point people at the wiki page

The hope is that will provide a good combination of executing correctly in the normal case while directing people on potentially buggy kernels to information to help them decide whether the problem is their kernel or their program or a bug in Go itself.

This can also be combined with better attempts to identify whether a particular kernel has been patched, based on the uname version field (we currently only check the release field).

Everyone please keep in mind that successful communication is hard and a skill one needs to practice. Emotions can work against us and hinder our goal of successful communication, I’ve been there myself. Yes, there has been a violation of the code of conduct and pointing it out is good. A voluntary apology is also helpful. Now let’s try to make sure that every post has a positive net impact on collaboration and solving this issue.

@networkimprov Disabling signal preemption makes the bug less likely to occur but it is still present. It’s a bug in certain Linux kernel versions. The bug affects all programs in all languages. It’s particularly likely to be observable with Go programs that use signal preemption, but it’s present for all other programs as well.

Go tries to work around the bug by mlocking the signal stack. That works fine unless you run into the mlock limit. I suppose that one downside of this workaround is that we make the problem very visible, rather than occasionally failing due to random memory corruption as would happen if we didn’t do the mlock.

At some point there is no way to work around a kernel bug.

I’m on latest Ubuntu and latest available kernel. Apparently all available Ubuntu kernels are unsuitable for Go 1.14 https://packages.ubuntu.com/search?keywords=linux-image-generic based on the error message.

And the pile keeps on piling. See, this is why I got angry at the beginning of this thread (which was a bad mistake from my part, I agree). Even though I made a lot of effort to explain and provide a repro that this is a blocker, I was shut down to not interfere with the release. Even after it was clear that it’s not a docker issue.

Now we’re in a much worse space since various projects are blacklisting Go 1.14. This bug is currently slated to be fixed in Go 1.15 only. Based on the above linked issues, are we confident that it’s a good idea to postpone this by 8 months? I think it would be nice to acknowledge the messup and try to fix it in a patch release, not wait for more projects to be bitten.

Yes, I’m aware that I’m just nagging people here instead of fixing it myself. I’m sorry I can’t contribute more meaningfully, I just don’t want to fragment the ecosystem. Go modules were already a blow to many projects, let’s not double down with yet another quirk that tools need to become aware of.

I encountered this issue when running go applications on a self-hosted Kubernetes cluster. I was able to allow the mlock workaround to take effect by increasing the relevant ulimit. However, as the process for changing ulimits for Docker containers running in Kubernetes isn’t exactly easy to find, it might help someone else to put the details here.

  1. Update /etc/security/limits.conf to include something like
* - memlock unlimited
  1. Update /etc/docker/daemon.json to include
"default-ulimits": { "memlock": { "name": "memlock", "hard": -1, "soft": -1 } }
  1. Restart docker/kubernetes, bring your pods back up.

  2. Enter a running container and verify that the ulimit has been increased:

$ kubectl exec -it pod-name -- /bin/sh
/ # ulimit -l
unlimited

You may be able to get away with using something more subtle than the unlimited hammer; for me, a limit of 128KiB (131072) seemed to work.

It’s “maintained by the Docker Community”. Issues should be filed at

https://github.com/docker-library/golang/issues

EDIT: the problem is the host kernel, not the Docker library image, so they can’t fix it.

Well, @randall77 / @ianlancetaylor, I tend to disagree that this is a golang issue at all. Golang discovered the memory corruption issue, but it is a very severe kernel bug.

As such it should escalate through your kernel paths. Distributions picked up the patch and shipped it. It was backported. Every new installation will get a non-affected kernel. If you roll your own kernel you have to do that work yourself. As usual.

Be helpful for users that hit it, and be as helpful as possible. But I don’t think it is golang responsibility to fix a kernel bug or even force users to apply the patch.

@karalabe in as strong terms as I can muster, you were not “shut down to not interfere with the release”. Josh, who originally closed the issue, is not on the Go team at Google (i.e. not a decision maker) nor am I. We initially assumed that the Docker project could (and should) mitigate the problem in their build. When it became clear they couldn’t, I promptly raised this on golang-dev.

Furthermore, I was the first to note that the problem stems from your host kernel, not the Docker module. You didn’t mention you’re on Ubuntu until I pointed that out.

I think you owe us yet another apology, after that note.

EDIT: I also asked you to remove the goroutine stack traces (starting goroutine x [runnable]) from your report, as they make the page difficult to read/navigate. [Update: Russ has edited out the stacks.]

Just a heads up - Go1.15 is about to be released, and a beta was already released, but the temporary patch was not yet removed (There are todo comments to remove at Go1.15).

I think it is important to remove the workaround since Ubuntu 20.04 LTS uses a patched 5.4.0 kernel. This means that any user on Ubuntu 20.04 will still unnecessarily mlock pages, and if he runs in a docker container, that warning will be displayed for every crash, disregarding the fact that his kernel is not really buggy. So those users might be sent on a wild goose chase trying to understand and read all this info, and it will have nothing to do with their bug, probably for the entirety of Ubuntu 20.04 life cycle.

Disabling async preemption is a distraction. Programs running on faulty kernels are still broken. It’s just that the brokenness shows up as weird memory corruption rather than as an error about running into an mlock limit that points to kernel versions. While obviously we want to fix the problem entirely, I think that given the choice of a clear error or random memory corruption we should always pick the clear error.

I agree that kernel version detection is terrible, it’s just that we don’t know of any other option. If anybody has any suggestions in that regard, that would be very helpful.

One thing that we could do is add a GODEBUG setting to disable mlocking the signal stack. That would give people a workaround that is focused on the actual problem. We can mention that setting in the error message. I’m afraid that it will lead to people to turn on the setting whether they have a patched kernel or not. But at least it will give people who really do have a patched kernel a way to work around this problem. CC @aclements

When you say you are on the latest ubuntu and kernel what exactly do you mean (i.e. output of dpkg -l linux-image-*, lsb_release -a, uname -a, that sort of thing) because as far as I can see the fix is in the kernel in the updates pocket for both 19.10 (current stable release) and 20.04 (devel release). It’s not in the GA kernel for 18.04 but is in the HWE kernel, but otoh those aren’t built with gcc 9 and so shouldn’t be affected anyway.

Here is something we could do:

  1. Use uname to check the kernel version for a vulnerable kernel, as we do today.
  2. If the kernel is vulnerable according to the version, read /proc/version.
  3. If /proc/version contains the string "2020", assume that the kernel is patched.
  4. If /proc/version contains the string "gcc version 8" assume that the kernel works even if patched (as the bug only occurs when the kernel is compiled with GCC 9 or later).
  5. Otherwise, call mlock on signal stacks as we do today on vulnerable kernels.

The point of this is to reduce the number of times that Go programs run out of mlock space.

Does anybody know of any unpatched kernels that may have the string "2020" in /proc/version?

For safety we should probably try to identify the times when the kernel was patched for the major distros. Is there anybody who can identify that for any particular distro? Thanks.

The discussion seems to be about either accepting more false positives or false negatives. Here’s a summary:

False positive: The workaround gets enabled on a patched kernel.

  • Reproducible. Instructions can be shown.
  • Looks like a regression.
  • Hard to fix in certain environments.
  • Go binary may run in some environments but fails to run in others.

False negative: The workaround is not enabled on an unpatched kernel.

  • Failure only happens rarely, especially if async preemption is disabled.
  • Possibly severe consequences due to memory corruption.
  • Hard to debug.

I may have missed something (this thread got long fast!), but what’s the downside or difficulty of just raising the mlock limit? There’s little reason to not just set it to unlimited, but even if you don’t want to do that, you only need 4 KiB per thread, so a mere 64 MiB is literally more than the runtime of a single process is capable of mlocking. AFAIK, most distros leave it unlimited by default. The only notable exception I’m aware of is Docker, which sets it to (I think) 64 KiB by default, but this can be raised by passing --ulimit memlock=67108864 to Docker.

It seems like we already have a fairly simple workaround in place. Is there something preventing people from doing this?

@rtreffer That’s what we’re trying to do: be as helpful as possible.

On buggy kernels, Go programs built with Go 1.14 behaved unpredictably and badly. We don’t want to do that even on a buggy kernel. If a program would just fail quickly and cleanly, that would be one thing. But what we saw was memory corruption leading to obscure errors. See #35326, among others.

Do you think we should take some action different than what we are doing now?

@fcuello-fudo the problem is that if the workaround is not enabled on a bad kernel, the symptoms are very obscure.

How about re-using the “tainted” concept from the Linux kernel? The Go runtime would keep detecting bad kernels and apply the mlock workaround, but marks itself tainted if mlock fails (and not crash). Then, make sure add a note to any panics and throws messages if the taint flag is set.

The upside is that false positive crashes are avoided, while still providing a clear indication in case a bad kernel causes a crash.

The downside is that a bad kernel may silently corrupt memory, not causing a visible crash.

@jrockway Thanks, the problem is not that we don’t have the kernel version, it’s that Ubuntu is using a kernel version that has the bug, but Ubuntu has applied a patch for the bug, so the kernel actually works, but we don’t know how to detect that fact.

@neelance

Is it common for Ubuntu (and other distributions?) to use cherry-picking instead of following Linux kernel patch releases?

A lot of distributions do it, not just Ubuntu. Debian does it, Red Hat Enterprise Linux does it, and I expect that SUSE does it for their enterprise distributions as well. Cherry-picking is the only way to get any bug-fixes at all if you cannot aggressively follow upstream stable releases (and switching stable releases as upstream support goes away). Fedora is an exception; it rebases to the latest stable release upstream kernel after a bit.

There’s also the matter of proprietary kernels used by container engines. We can’t even look at sources for them, and some of them have lied about kernel version numbers in the past. I expect they also use cherry-picking.

Generally, version checks for kernel features (or bugs) are a really bad idea. It’s worse for Go due to the static linking, so it’s impossible to swap out the run-time underneath an application to fix its kernel expectations.

Unfortunately not. If mlock fails and you have a buggy kernel, then memory corruption might be occurring. Just because the program isn’t crashing doesn’t mean there wasn’t corruption somewhere. Crashing is a side-effect of memory corruption - just the mlock failing will not cause a crash. (We used to do that in 1.14. That’s one of the things we changed for 1.14.1.) Even if you turn async preemption off, memory corruption might still be occurring. Just at a lower rate, as your program is probably still getting other signals (timers, etc.).

@ianlancetaylor I totally agree with the way forward, the patch and wiki page look great.

I wanted to emphasize that the corruption it is not golangs fault or bug to begin with and distributions are shipping fixed kernels. It should already be fading away.

As a result I am don’t think anything more than the suggested hints (wiki+panic) are needed.

@rtreffer Well sorta. We have some production 5.2 kernels that are not affected as they aren’t compiled with gcc9 and we could also easily patch in the fix into our kernel line without affecting anything else and be fine. Kernel bug doesn’t exist in our environment and upgrading major versions take a lot more testing and careful roll out across the fleet so just “upgrade your kernel” isn’t a good situation.

On the flip side the workaround based on kernel version numbers caused us to move to mlocks which DID fail due to ulimit issues. That isn’t a kernel bug.

That being said I am not sure there is a better solution here and the Go team probably made the right call.

I am not sure if this is at all helpful, but Ubuntu apparently does make the standard kernel version available to those that go looking:

$ cat /proc/version_signature
Ubuntu 5.3.0-1013.14-azure 5.3.18

https://github.com/golang/go/issues/37436#issuecomment-591237929 states that disabling async preemption is not a proper solution, is it?

Yes, but in my previous posts I highlighted that I’m already on the latest Ubuntu and have installed the latest available kernel from the package repository. I don’t see how I could update my kernel to work with Go 1.14 apart from rebuilding the entire kernel from source. Maybe I’m missing something?

Hm yes, I have just reproduced on focal too. The fix is present in the git for the Ubuntu eoan kernel: https://kernel.ubuntu.com/git/ubuntu/ubuntu-eoan.git/commit/?id=59e7e6398a9d6d91cd01bc364f9491dc1bf2a426 and that commit is in the ancestry for the 5.3.0-40.32 so the fix should be in the kernel you are using. In other words, I think we need to get the kernel team involved – I’ll try to do that.

Yes, but in my previous posts I highlighted that I’m already on the latest Ubuntu and have installed the latest available kernel from the package repository. I don’t see how I could update my kernel to work with Go 1.14 apart from rebuilding the entire kernel from source. Maybe I’m missing something?

I’m not fully sure who and how maintains the docker images (https://hub.docker.com/_/golang), but the docker hub repo is an “Official Image”, which is a super hard to obtain status, so I assume someone high enough the food chain is responsible.

The error message suggests the only two known available fixes: increase the ulimit or upgrade to a newer kernel.

Well, I’m running the official alpine docker image, the purpose of which is to be able to build a Go program. Apparently it cannot. IMHO the upstream image should be the one fixed to fulfill its purpose, not our build infra to hack around a bug in the upstream image.

Gonna try then explicitly loading the go version, I expected go get golang.org/dl/go1.14 loads the latest 1.14. Will report back.

Edit, seems 1.14.3 is latest 1.14 as of today

Update: looks good with go get golang.org/dl/go1.14.3, unexpected that without patch that is not loading the latest, good to know (I would’ve never landed into this issue otherwise)

I’ve sent out https://golang.org/cl/223121. It would be helpful if people having trouble with 1.14 could see if that change fixes their problem. Thanks.,

@networkimprov That is a good idea, but since people send compiled programs across machines that may have different kernel versions, I think we also need the approach described above.

@randall77 @aarzilli Upon consideration, I actually don’t think it is a great idea to add additional partial mitigations like touching the signal stack page or disabling asynchronous preemption. It’s a kernel level bug that can affect any program that receives a signal. People running with a buggy kernel should upgrade the kernel one way or another. Using mlock is a reliable mitigation that should always work, and as such it’s reasonable to try it. Touching the signal stack before sending a preemption signal, or disabling signal preemption entirely, is not a reliable mitigation. I think that we should not fall back to an unreliable mitigation; we should tell the user to upgrade the kernel.

People who can’t upgrade the kernel have the option of running with GODEBUG=asyncpreemptoff=1, which will be just as effective a partial mitigation as the other two.

Is there a way to test (outside of the Go runtime) to see if your kernel is patched? We maintain our own kernels outside of the distro which makes this even workse.

Ian Lance-Taylor has said that the fix will be backported once there is one: https://groups.google.com/d/msg/golang-dev/_FbRwBmfHOg/mmtMSjO1AQAJ

also running into this issue trying to build a docker image for https://github.com/RTradeLtd/Temporal on go 1.14

I think it should be opt-in to avoid all false positives. For example add GOWORKAROUNDS env var boolean or with a list of work-arounds or to enable the heuristic to try to find them.

This would be the least intrusive solution IMO.

@ucirello There is no problem on 4.4.x Linux kernels. The bug first appeared in kernel version 5.2.

@ianlancetaylor I’ve created quick and dirty script to check what uname reports: https://gist.github.com/Tasssadar/7424860a2764e3ef42c7dcce7ecfd341

Here’s the result on up-to-date (well, -ish) Debian testing:

tassadar@dorea:~/tmp$ go run gouname.go 
real uname
Linux dorea 5.4.0-3-amd64 #1 SMP Debian 5.4.13-1 (2020-01-19) x86_64 GNU/Linux

our uname
sysname Linux
nodename dorea
release 5.4.0-3-amd64  <-- used by go
version #1 SMP Debian 5.4.13-1 (2020-01-19)
machine x86_64
domainname (none)

Since Go is only using the release string, the patch version check basically does not work anywhere but on vanilla kernels - both Debian and RHEL/CentOS (which has too old kernel luckily) do it this way, they keep the .0 and specify the real patch version later. Unfortunately, they don’t use the same format for version.

EDIT: and to make it even more awkward, Ubuntu does not put the patch number into uname at all, even though they probably have all the fixes incorporated. Perhaps the best course of action is to make this a warning instead of crash? At this point, most kernels are probably already updated anyway.

Perhaps we could use the date recorded in /proc/version as an additional signal. It should probably be release specific, which is a pain. But the whole thing is painful.

Actually, could the runtime fork a child process on startup (for 5.3.x & 5.4.x) that triggers the bug and enable the workaround if it does? IIRC there is a reliable reproducer, see #35326 (comment)

It’s an interesting idea but I think that in this case the test is much too expensive to run at startup for every single Go program.

@karalabe - I’ve just realised my mistake: I though I was using the latest Ubuntu, I am in fact using eoan.

@mwhudson - just one thing to note (although you’re probably already aware of this), a superficial glance at the code responsible for this switch:

https://github.com/golang/go/blob/20a838ab94178c55bc4dc23ddc332fce8545a493/src/runtime/os_linux_x86.go#L56-L61

seems to suggest that the Go side is checking for patch release 15 or greater. What does 5.3.0-40.32 report as a patch version? I’m guessing 0?

Re-opening this discussion until we round out the issue here.

Just to emphasize, I do understand what the workaround is and if I want to make it work, I can. I opened this issue report because I’d expect other people to hit the same problem eventually. If just updating my system would fix the issue I’d gladly accept that as a solution, but unless I’m missing something, the fixed kernel is not available for (recent) Ubuntu users, so quite a large userbase might be affected.