go: runtime: MADV_HUGEPAGE causes stalls when allocating memory
Environment: linux/amd64
I’ve bisected stalls in one of my applications to 8fa9e3beee8b0e6baa7333740996181268b60a3a — after discussion with @mknyszek, the stalls seem to be caused by Linux directly reclaiming pages, and taking significant time to do so (100+ ms in my case.)
The direct reclaiming is caused by the combination of Go setting memory as MADV_HUGEPAGE and Transparent Huge Pages being configured as such on my system (which AFAICT is a NixOS default; I don’t recall changing this:)
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise [madvise] never
In particular, the madvise
setting for defrag
has the following effect:
will enter direct reclaim like always but only for regions that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.
with always
meaning
means that an application requesting THP will stall on allocation failure and directly reclaim pages and compact memory in an effort to allocate a THP immediately. This may be desirable for virtual machines that benefit heavily from THP use and are willing to delay the VM start to utilise them
It seems to me that one of the reasons for setting MADV_HUGEPAGE is to undo setting MADV_NOHUGEPAGE and that there is no other way to do that.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 32 (23 by maintainers)
Commits related to this issue
- runtime: avoid MADV_HUGEPAGE for heap memory Currently the runtime marks all new memory as MADV_HUGEPAGE on Linux and manages its hugepage eligibility status. Unfortunately, the default THP behavior ... — committed to cellularmitosis/go by mknyszek a year ago
- [release-branch.go1.21] runtime: avoid MADV_HUGEPAGE for heap memory Currently the runtime marks all new memory as MADV_HUGEPAGE on Linux and manages its hugepage eligibility status. Unfortunately, t... — committed to golang/go by mknyszek a year ago
- _content/doc: discuss transparent huge pages in the GC guide For golang/go#8832. For golang/go#55328. For golang/go#61718. Change-Id: I1ee51424dc2591a84f09ca8687c113f0af3550d1 Reviewed-on: https://g... — committed to golang/website by mknyszek 10 months ago
- runtime: don't eagerly collapse hugepages This has caused performance issues in production environments. Disable it until further notice. Fixes #63334. Related to #61718 and #59960. Change-Id: If8... — committed to golang/go by mknyszek 9 months ago
- runtime: delete hugepage tracking dead code After the previous CL, this is now all dead code. This change is separated out to make the previous one easy to backport. For #63334. Related to #61718 an... — committed to golang/go by mknyszek 9 months ago
- [release-branch.go1.21] runtime: don't eagerly collapse hugepages This has caused performance issues in production environments. MADV_COLLAPSE can go into direct reclaim, but we call it with the hea... — committed to golang/go by mknyszek 9 months ago
- _content/doc: discuss transparent huge pages in the GC guide For golang/go#8832. For golang/go#55328. For golang/go#61718. Change-Id: I1ee51424dc2591a84f09ca8687c113f0af3550d1 Reviewed-on: https://g... — committed to orijtech/website by mknyszek 10 months ago
/proc/<pid>/smaps_rollup
can do the totalling for you.👋 We noticed after updating to Go 1.21.0 that some of our apps were using more off heap memory than others. There was a wide gap in between the reported container memory (kubernetes) and the process RSS or go mstats heap_sys value.
We figured that this issue might be related and indeed updating to Go 1.21.1 solves the issue for us but we aren’t sure how the problem described here could contribute to larger amounts of retained off heap memory.
First, does the problem described here sound like it could also cause the issue that we are/were seeing? Second, do you have any suggestions for telemetry that we could look at to confirm? None of the usual suspects for us were showing anything apart from the aforementioned gap between the container memory and go heap memory
The graph below illustrates the gap, with go
1.21.0
running and then the same service being deployed with1.20.7
And I forgot to say, thank you for your time and effort in looking into this @dominikh!
Glad to hear!
Unfortunately Linux doesn’t provide a very good way to observe how much memory is going to huge pages. You can occasionally dump
/proc/<pid>/smaps
and total up theAnonHugePages
count. Other than that, I don’t think there’s a lot you can do. 😦Fortunately, I don’t think you’ll have to worry about this being a problem from Go in the future. Now that we have a better understanding of the landscape of hugepage-related
madvise
syscalls, I don’t think we’ll be trying to explicitly back the heap with hugepages outside of specific cases and only a best-effort basis (like Go 1.21.1 now does).Go
1.21.1
does indeed solve the issue for us.Below is the output:
Are there any metrics that we could have helped us track this down further, or similar issues in the future? All we were able to tell was that the memory was being retained somewhere “off heap” but we had little visibility into what it was.
Meanwhile I can reliably reproduce the issue with Go at master, and not at all with Go 1.20. In fact, seeing such high latencies with Go 1.20 seems particularly weird, as it’s not actually making use of huge pages much at all.
The problem is very sensitive to the amount of memory fragmentation, which can make it difficult to trigger. There have to be few enough (or none) allocations available for huge pages, and “direct reclaim” must not be able to quickly merge pages, either.
I did end up having more luck with Michael’s stress test than with
stress
. At this point, running 10 instances of Michael’s stress test — compiled with Go master — followed by my reproducer, leads to a maximum pause of 1.29s when the reproducer is built with Go master, and ~20ms when built with Go 1.20.6, with the 20ms pauses likely being due to Linux scheduler pressure, as the stress test keeps all cores busy.It’s also worth noting that the stress test acts as an easier way of reproducing the problem, not a requirement. I originally reproduced the problem just by having a lot of typical, heavy desktop software open (Firefox with a significant amount of tabs, Discord, Slack), which left memory quite fragmented. The stress test is a much easier way of using up large allocations.