go: runtime: mlock of signal stack failed: 12

A little summary because I had to piece it together myself:

Root cause: https://bugzilla.kernel.org/show_bug.cgi?id=205663
Affected Linux kernels: 5.2.x, 5.3.0-5.3.14, 5.4.0-5.4.1
Fixed Linux kernel 5.3.15 released at 2019-12-05: https://lwn.net/Articles/806395/
Fixed Linux kernel 5.4.2 released at 2019-12-05: https://lwn.net/Articles/806394/
Go’s partial workaround: https://go-review.googlesource.com/c/go/+/209899/
Available Linux kernels for the latest Ubuntu release: https://packages.ubuntu.com/search?keywords=linux-image-generic
Latest linux kernel available for Ubuntu is version 5.3.0
However, it seems like the patch was cherry-picked into Ubuntu’s 5.3.0 kernel: https://kernel.ubuntu.com/git/ubuntu/ubuntu-eoan.git/commit/?id=59e7e6398a9d6d91cd01bc364f9491dc1bf2a426

So it seems like Ubuntu’s kernel is patched, but the workaround gets enabled anyways.

+64

neelance on Feb 26, 2020

This issue does not happen with Go 1.13. Ergo, it is a bug introduced in Go 1.14.

Saying you can’t fix it and telling people to use workarounds it is dishonest, because reverting a piece of code would actually fix it. An alternative solution would be to detect the problematic platforms / kernels and provide a fallback mechanism baked into Go.

Telling people to use a different kernel is especially nasty, because it’s not as if most people can go around and build themselves a new kernel. If alpine doesn’t release a new kernel, there’s not much most devs can do. And lastly if your project relies on a stable infrastructure where you can’t just swap out kernels, you’re again in a pickle.

It is standard practice to redirect issues to the correct issue tracking system.

The fact that Go crashes is not the fault of docker. Redirecting a Go crash to a docker repo is deflection.

+62

@karalabe I would like to remind you of https://golang.org/conduct. In particular, please be respectful and be charitable.

+55

josharian on Feb 25, 2020

Please answer the question

+54

Based on discussion with @aclements, @dr2chase, @randall77, and others, our plan for the 1.14.1 release is:

write a wiki page describing the problem
continue to use mlock on a kernel version that may be buggy
if mlock fails, silently note that fact and continue executing
if we see an unexpected SIGSEGV or SIGBUS, and mlock failed, then in the crash stack trace point people at the wiki page

The hope is that will provide a good combination of executing correctly in the normal case while directing people on potentially buggy kernels to information to help them decide whether the problem is their kernel or their program or a bug in Go itself.

This can also be combined with better attempts to identify whether a particular kernel has been patched, based on the uname version field (we currently only check the release field).

+46

ianlancetaylor on Mar 10, 2020

Everyone please keep in mind that successful communication is hard and a skill one needs to practice. Emotions can work against us and hinder our goal of successful communication, I’ve been there myself. Yes, there has been a violation of the code of conduct and pointing it out is good. A voluntary apology is also helpful. Now let’s try to make sure that every post has a positive net impact on collaboration and solving this issue.

+21

neelance on Mar 6, 2020

@networkimprov Disabling signal preemption makes the bug less likely to occur but it is still present. It’s a bug in certain Linux kernel versions. The bug affects all programs in all languages. It’s particularly likely to be observable with Go programs that use signal preemption, but it’s present for all other programs as well.

Go tries to work around the bug by mlocking the signal stack. That works fine unless you run into the mlock limit. I suppose that one downside of this workaround is that we make the problem very visible, rather than occasionally failing due to random memory corruption as would happen if we didn’t do the mlock.

At some point there is no way to work around a kernel bug.

+16

I’m on latest Ubuntu and latest available kernel. Apparently all available Ubuntu kernels are unsuitable for Go 1.14 https://packages.ubuntu.com/search?keywords=linux-image-generic based on the error message.

+11

I’ve written the wiki page at https://golang.org/wiki/LinuxKernelSignalVectorBug.

+10

https://github.com/docker-library/golang/issues

And the pile keeps on piling. See, this is why I got angry at the beginning of this thread (which was a bad mistake from my part, I agree). Even though I made a lot of effort to explain and provide a repro that this is a blocker, I was shut down to not interfere with the release. Even after it was clear that it’s not a docker issue.

Now we’re in a much worse space since various projects are blacklisting Go 1.14. This bug is currently slated to be fixed in Go 1.15 only. Based on the above linked issues, are we confident that it’s a good idea to postpone this by 8 months? I think it would be nice to acknowledge the messup and try to fix it in a patch release, not wait for more projects to be bitten.

Yes, I’m aware that I’m just nagging people here instead of fixing it myself. I’m sorry I can’t contribute more meaningfully, I just don’t want to fragment the ecosystem. Go modules were already a blow to many projects, let’s not double down with yet another quirk that tools need to become aware of.

karalabe on Mar 6, 2020

I encountered this issue when running go applications on a self-hosted Kubernetes cluster. I was able to allow the mlock workaround to take effect by increasing the relevant ulimit. However, as the process for changing ulimits for Docker containers running in Kubernetes isn’t exactly easy to find, it might help someone else to put the details here.

Update /etc/security/limits.conf to include something like

* - memlock unlimited

Update /etc/docker/daemon.json to include

"default-ulimits": { "memlock": { "name": "memlock", "hard": -1, "soft": -1 } }

Restart docker/kubernetes, bring your pods back up.
Enter a running container and verify that the ulimit has been increased:

$ kubectl exec -it pod-name -- /bin/sh
/ # ulimit -l
unlimited

You may be able to get away with using something more subtle than the unlimited hammer; for me, a limit of 128KiB (131072) seemed to work.

bschofield on Mar 2, 2020

It’s “maintained by the Docker Community”. Issues should be filed at

EDIT: the problem is the host kernel, not the Docker library image, so they can’t fix it.

networkimprov on Feb 25, 2020

Well, @randall77 / @ianlancetaylor, I tend to disagree that this is a golang issue at all. Golang discovered the memory corruption issue, but it is a very severe kernel bug.

As such it should escalate through your kernel paths. Distributions picked up the patch and shipped it. It was backported. Every new installation will get a non-affected kernel. If you roll your own kernel you have to do that work yourself. As usual.

Be helpful for users that hit it, and be as helpful as possible. But I don’t think it is golang responsibility to fix a kernel bug or even force users to apply the patch.

rtreffer on Mar 12, 2020

@karalabe in as strong terms as I can muster, you were not “shut down to not interfere with the release”. Josh, who originally closed the issue, is not on the Go team at Google (i.e. not a decision maker) nor am I. We initially assumed that the Docker project could (and should) mitigate the problem in their build. When it became clear they couldn’t, I promptly raised this on golang-dev.

Furthermore, I was the first to note that the problem stems from your host kernel, not the Docker module. You didn’t mention you’re on Ubuntu until I pointed that out.

I think you owe us yet another apology, after that note.

EDIT: I also asked you to remove the goroutine stack traces (starting goroutine x [runnable]) from your report, as they make the page difficult to read/navigate. [Update: Russ has edited out the stacks.]

networkimprov on Mar 6, 2020

Just a heads up - Go1.15 is about to be released, and a beta was already released, but the temporary patch was not yet removed (There are todo comments to remove at Go1.15).

I think it is important to remove the workaround since Ubuntu 20.04 LTS uses a patched 5.4.0 kernel. This means that any user on Ubuntu 20.04 will still unnecessarily mlock pages, and if he runs in a docker container, that warning will be displayed for every crash, disregarding the fact that his kernel is not really buggy. So those users might be sent on a wild goose chase trying to understand and read all this info, and it will have nothing to do with their bug, probably for the entirety of Ubuntu 20.04 life cycle.

DanielShaulov on Jul 13, 2020

Disabling async preemption is a distraction. Programs running on faulty kernels are still broken. It’s just that the brokenness shows up as weird memory corruption rather than as an error about running into an mlock limit that points to kernel versions. While obviously we want to fix the problem entirely, I think that given the choice of a clear error or random memory corruption we should always pick the clear error.

I agree that kernel version detection is terrible, it’s just that we don’t know of any other option. If anybody has any suggestions in that regard, that would be very helpful.

One thing that we could do is add a GODEBUG setting to disable mlocking the signal stack. That would give people a workaround that is focused on the actual problem. We can mention that setting in the error message. I’m afraid that it will lead to people to turn on the setting whether they have a patched kernel or not. But at least it will give people who really do have a patched kernel a way to work around this problem. CC @aclements

When you say you are on the latest ubuntu and kernel what exactly do you mean (i.e. output of dpkg -l linux-image-*, lsb_release -a, uname -a, that sort of thing) because as far as I can see the fix is in the kernel in the updates pocket for both 19.10 (current stable release) and 20.04 (devel release). It’s not in the GA kernel for 18.04 but is in the HWE kernel, but otoh those aren’t built with gcc 9 and so shouldn’t be affected anyway.

mwhudson on Feb 25, 2020

@RobertLucian See the discussion at https://golang.org/wiki/LinuxKernelSignalVectorBug

ianlancetaylor on Jan 21, 2021

Here is something we could do:

Use uname to check the kernel version for a vulnerable kernel, as we do today.
If the kernel is vulnerable according to the version, read /proc/version.
If /proc/version contains the string "2020", assume that the kernel is patched.
If /proc/version contains the string "gcc version 8" assume that the kernel works even if patched (as the bug only occurs when the kernel is compiled with GCC 9 or later).
Otherwise, call mlock on signal stacks as we do today on vulnerable kernels.

The point of this is to reduce the number of times that Go programs run out of mlock space.

Does anybody know of any unpatched kernels that may have the string "2020" in /proc/version?

For safety we should probably try to identify the times when the kernel was patched for the major distros. Is there anybody who can identify that for any particular distro? Thanks.

The discussion seems to be about either accepting more false positives or false negatives. Here’s a summary:

False positive: The workaround gets enabled on a patched kernel.

Reproducible. Instructions can be shown.
Looks like a regression.
Hard to fix in certain environments.
Go binary may run in some environments but fails to run in others.

False negative: The workaround is not enabled on an unpatched kernel.

Failure only happens rarely, especially if async preemption is disabled.
Possibly severe consequences due to memory corruption.
Hard to debug.

neelance on Feb 26, 2020

I may have missed something (this thread got long fast!), but what’s the downside or difficulty of just raising the mlock limit? There’s little reason to not just set it to unlimited, but even if you don’t want to do that, you only need 4 KiB per thread, so a mere 64 MiB is literally more than the runtime of a single process is capable of mlocking. AFAIK, most distros leave it unlimited by default. The only notable exception I’m aware of is Docker, which sets it to (I think) 64 KiB by default, but this can be raised by passing --ulimit memlock=67108864 to Docker.

It seems like we already have a fairly simple workaround in place. Is there something preventing people from doing this?

aclements on Feb 26, 2020

@rtreffer That’s what we’re trying to do: be as helpful as possible.

On buggy kernels, Go programs built with Go 1.14 behaved unpredictably and badly. We don’t want to do that even on a buggy kernel. If a program would just fail quickly and cleanly, that would be one thing. But what we saw was memory corruption leading to obscure errors. See #35326, among others.

Do you think we should take some action different than what we are doing now?

@fcuello-fudo the problem is that if the workaround is not enabled on a bad kernel, the symptoms are very obscure.

How about re-using the “tainted” concept from the Linux kernel? The Go runtime would keep detecting bad kernels and apply the mlock workaround, but marks itself tainted if mlock fails (and not crash). Then, make sure add a note to any panics and throws messages if the taint flag is set.

The upside is that false positive crashes are avoided, while still providing a clear indication in case a bad kernel causes a crash.

The downside is that a bad kernel may silently corrupt memory, not causing a visible crash.

eliasnaur on Feb 28, 2020

@jrockway Thanks, the problem is not that we don’t have the kernel version, it’s that Ubuntu is using a kernel version that has the bug, but Ubuntu has applied a patch for the bug, so the kernel actually works, but we don’t know how to detect that fact.

@neelance

Is it common for Ubuntu (and other distributions?) to use cherry-picking instead of following Linux kernel patch releases?

A lot of distributions do it, not just Ubuntu. Debian does it, Red Hat Enterprise Linux does it, and I expect that SUSE does it for their enterprise distributions as well. Cherry-picking is the only way to get any bug-fixes at all if you cannot aggressively follow upstream stable releases (and switching stable releases as upstream support goes away). Fedora is an exception; it rebases to the latest stable release upstream kernel after a bit.

There’s also the matter of proprietary kernels used by container engines. We can’t even look at sources for them, and some of them have lied about kernel version numbers in the past. I expect they also use cherry-picking.

Generally, version checks for kernel features (or bugs) are a really bad idea. It’s worse for Go due to the static linking, so it’s impossible to swap out the run-time underneath an application to fix its kernel expectations.

fweimer on Feb 26, 2020

Unfortunately not. If mlock fails and you have a buggy kernel, then memory corruption might be occurring. Just because the program isn’t crashing doesn’t mean there wasn’t corruption somewhere. Crashing is a side-effect of memory corruption - just the mlock failing will not cause a crash. (We used to do that in 1.14. That’s one of the things we changed for 1.14.1.) Even if you turn async preemption off, memory corruption might still be occurring. Just at a lower rate, as your program is probably still getting other signals (timers, etc.).

randall77 on Mar 20, 2020

@ianlancetaylor I totally agree with the way forward, the patch and wiki page look great.

I wanted to emphasize that the corruption it is not golangs fault or bug to begin with and distributions are shipping fixed kernels. It should already be fading away.

As a result I am don’t think anything more than the suggested hints (wiki+panic) are needed.

rtreffer on Mar 12, 2020

@rtreffer Well sorta. We have some production 5.2 kernels that are not affected as they aren’t compiled with gcc9 and we could also easily patch in the fix into our kernel line without affecting anything else and be fine. Kernel bug doesn’t exist in our environment and upgrading major versions take a lot more testing and careful roll out across the fleet so just “upgrade your kernel” isn’t a good situation.

On the flip side the workaround based on kernel version numbers caused us to move to mlocks which DID fail due to ulimit issues. That isn’t a kernel bug.

That being said I am not sure there is a better solution here and the Go team probably made the right call.

nemith on Mar 12, 2020

I am not sure if this is at all helpful, but Ubuntu apparently does make the standard kernel version available to those that go looking:

$ cat /proc/version_signature
Ubuntu 5.3.0-1013.14-azure 5.3.18

jrockway on Feb 26, 2020

https://github.com/golang/go/issues/37436#issuecomment-591237929 states that disabling async preemption is not a proper solution, is it?

neelance on Feb 26, 2020

Yes, but in my previous posts I highlighted that I’m already on the latest Ubuntu and have installed the latest available kernel from the package repository. I don’t see how I could update my kernel to work with Go 1.14 apart from rebuilding the entire kernel from source. Maybe I’m missing something?

Hm yes, I have just reproduced on focal too. The fix is present in the git for the Ubuntu eoan kernel: https://kernel.ubuntu.com/git/ubuntu/ubuntu-eoan.git/commit/?id=59e7e6398a9d6d91cd01bc364f9491dc1bf2a426 and that commit is in the ancestry for the 5.3.0-40.32 so the fix should be in the kernel you are using. In other words, I think we need to get the kernel team involved – I’ll try to do that.

mwhudson on Feb 26, 2020

Yes, but in my previous posts I highlighted that I’m already on the latest Ubuntu and have installed the latest available kernel from the package repository. I don’t see how I could update my kernel to work with Go 1.14 apart from rebuilding the entire kernel from source. Maybe I’m missing something?

karalabe on Feb 26, 2020

I’m not fully sure who and how maintains the docker images (https://hub.docker.com/_/golang), but the docker hub repo is an “Official Image”, which is a super hard to obtain status, so I assume someone high enough the food chain is responsible.

The error message suggests the only two known available fixes: increase the ulimit or upgrade to a newer kernel.

Well, I’m running the official alpine docker image, the purpose of which is to be able to build a Go program. Apparently it cannot. IMHO the upstream image should be the one fixed to fulfill its purpose, not our build infra to hack around a bug in the upstream image.

Gonna try then explicitly loading the go version, I expected go get golang.org/dl/go1.14 loads the latest 1.14. Will report back.

Edit, seems 1.14.3 is latest 1.14 as of today

Update: looks good with go get golang.org/dl/go1.14.3, unexpected that without patch that is not loading the latest, good to know (I would’ve never landed into this issue otherwise)

dionysius on May 27, 2020

I’ve sent out https://golang.org/cl/223121. It would be helpful if people having trouble with 1.14 could see if that change fixes their problem. Thanks.,

@networkimprov That is a good idea, but since people send compiled programs across machines that may have different kernel versions, I think we also need the approach described above.

@randall77 @aarzilli Upon consideration, I actually don’t think it is a great idea to add additional partial mitigations like touching the signal stack page or disabling asynchronous preemption. It’s a kernel level bug that can affect any program that receives a signal. People running with a buggy kernel should upgrade the kernel one way or another. Using mlock is a reliable mitigation that should always work, and as such it’s reasonable to try it. Touching the signal stack before sending a preemption signal, or disabling signal preemption entirely, is not a reliable mitigation. I think that we should not fall back to an unreliable mitigation; we should tell the user to upgrade the kernel.

People who can’t upgrade the kernel have the option of running with GODEBUG=asyncpreemptoff=1, which will be just as effective a partial mitigation as the other two.

@nemith there is a C reproducer here: https://github.com/golang/go/issues/35326#issuecomment-558690446

networkimprov on Mar 11, 2020

Is there a way to test (outside of the Go runtime) to see if your kernel is patched? We maintain our own kernels outside of the distro which makes this even workse.

nemith on Mar 11, 2020

Ian Lance-Taylor has said that the fix will be backported once there is one: https://groups.google.com/d/msg/golang-dev/_FbRwBmfHOg/mmtMSjO1AQAJ

lmb on Mar 6, 2020

also running into this issue trying to build a docker image for https://github.com/RTradeLtd/Temporal on go 1.14

bonedaddy on Mar 3, 2020

I think it should be opt-in to avoid all false positives. For example add GOWORKAROUNDS env var boolean or with a list of work-arounds or to enable the heuristic to try to find them.

This would be the least intrusive solution IMO.

fcuello-fudo on Feb 28, 2020

@ucirello There is no problem on 4.4.x Linux kernels. The bug first appeared in kernel version 5.2.

@ianlancetaylor I’ve created quick and dirty script to check what uname reports: https://gist.github.com/Tasssadar/7424860a2764e3ef42c7dcce7ecfd341

Here’s the result on up-to-date (well, -ish) Debian testing:

tassadar@dorea:~/tmp$ go run gouname.go 
real uname
Linux dorea 5.4.0-3-amd64 #1 SMP Debian 5.4.13-1 (2020-01-19) x86_64 GNU/Linux

our uname
sysname Linux
nodename dorea
release 5.4.0-3-amd64  <-- used by go
version #1 SMP Debian 5.4.13-1 (2020-01-19)
machine x86_64
domainname (none)

Since Go is only using the release string, the patch version check basically does not work anywhere but on vanilla kernels - both Debian and RHEL/CentOS (which has too old kernel luckily) do it this way, they keep the .0 and specify the real patch version later. Unfortunately, they don’t use the same format for version.

EDIT: and to make it even more awkward, Ubuntu does not put the patch number into uname at all, even though they probably have all the fixes incorporated. Perhaps the best course of action is to make this a warning instead of crash? At this point, most kernels are probably already updated anyway.

Tasssadar on Feb 26, 2020

Perhaps we could use the date recorded in /proc/version as an additional signal. It should probably be release specific, which is a pain. But the whole thing is painful.

Actually, could the runtime fork a child process on startup (for 5.3.x & 5.4.x) that triggers the bug and enable the workaround if it does? IIRC there is a reliable reproducer, see #35326 (comment)

It’s an interesting idea but I think that in this case the test is much too expensive to run at startup for every single Go program.