go: runtime: 1.13 performance regression on kubernetes scaling

Together with golang team, we, kubernetes sig-scalability, are testing not-yet-released golang1.13 against kubernetes scale tests.

What version of Go are you using (go version)?

$ go version
devel +c290cb6338 Sun Jun 23 22:20:39 2019 +0000

Does this issue reproduce with the latest release?

Yes, it reproduces with golang head.

What did you do?

We’ve run k8s scale tests with kubernetes compiled with go1.13. There is a visible increase in 99th percentile of api call latency when compared to the same version of kubernetes compiled against go1.12.5.

What did you expect to see?

We expected similar api-call latency as in the baseline run compiled with go1.12.5 YVNzXYG5swf

What did you see instead?

The api-call latency was visibly worse in the run compiled with golang1.13 kNOEaY1LND8

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 3
  • Comments: 57 (29 by maintainers)

Commits related to this issue

Most upvoted comments

Hey Folks,

I think I have great news.

I’ve re-run the full tests on golang1.12 (baseline) and golang1.13 (at https://github.com/golang/go/commit/2754118731b81baf0c812116a9b72b6153abf79d) with the two patches that we tried already (which helped, but didn’t fix the performance issue totally), i.e., https://go-review.googlesource.com/c/go/+/189957/3, https://go-review.googlesource.com/c/go/+/183857/8 + the new @mknyszek’s patch, https://go-review.googlesource.com/c/go/+/193040/2, and there is no visible difference in api-call-latency. In some aspects the golang1.13 run is even slightly better than the baseline.

I’ll post a more detailed comparison here later, but the summary is that if we could land all those 3 patches, then we, K8s SIG-Scalability, would be completely satisfied with the golang1.13 performance, meaning that we can basically close this issue.

@aclements, @mknyszek, any thoughts on that? How hard would it be to release these patches? Any estimate on the potential timeline?

This issue is closed now because we fixed this on master, but for Go 1.13 we have two backport issues which track the two problems and their fixes for Go 1.13.

They are #34556 and #34149.

The comparison I promised in the previous post.

Api-call latency (non-LIST)

baseline api-call-baseline golang1.13 api-call-113

The results are very similar, golang1.13 has 3 spikes exceeding 1.5s that we don’t see in the baseline, but this is most likely noise. I’m quite sure that if we re-run the tests these spikes could disappear or even appear in the baseline. Single spikes are acceptable, and that’s why our performance SLIs are not based on instant api-call-latency but rather on aggregation over whole test, which we’ll compare below.

Aggregated api-call latency

baseline api-call-sli-baseline golang1.13 api-call-sli-113

During the test, the graph for golang1.13 looks worse due to the 3 spikes above, but looking at the end of the test the golang1.13 run actually looks better. There are only 6 metrics exceeding 500ms (in baseline there are 8) and none exceeding 1s (in baseline there is 1).

Aggregated LIST latency

baseline list-latency-baseline golang1.13 list-latency-113

This time it’s clearly in favor of the golang1.13. The aggregated list latency is clearly better for the golang1.13 run.

To sum up, some aspects are worse for golang1.13 and some are better, but most differences are most likely just a noise. Overall the golang1.13 run looks quite good comparing to the baseline. There is nothing that stands out and golang1.13 performance looks reasonable from the k8s scalability point of view. Putting it in other words, with the three patches we don’t observe any obvious performance degradation between golang1.12 and golang1.13.

I believe we can close this issue once all three patches merge.

Let me know if there is anything more we can assist you with and thank you so much for your tireless work on this issue!

I think I’ve narrowed down the range of CLs which contain the source of the regression, and they’re all mine.

Turns out it has nothing to do with the background scavenger, but rather with changes I made to the treap implementation in the runtime, which feels familiar. Reducing the number of treap operations required for allocation/free/scavenging probably isn’t possible without a huge refactoring, so I think the main thing to do here is to try to make the operations themselves faster. I didn’t expect what I added to have such a large effect but given that there’s a lot of heap lock contention I suppose I shouldn’t be surprised.

In the past I attempted to add a cache to speed up allocations, but it didn’t help much. But, perhaps there’s something yet-unexplored here. I’ll think about it some more.