actions-runner-controller: Race condition in upscale webhook
There is a race condition in the web base upscale, where 2 events occur at the same time. Essentially two events should scale +1, yielding in 2extra runners. But due to concurrency only 1 extra runner is booted.
Here a log from the webhook server, 4events are handled. But only 3 runners got reserved/booted. I’ve been suspecting this and I’m super happy I found this, because it means we can fix it.
Could explain many issues reported (and we ourselves are hitting this multiple times daily). https://github.com/actions-runner-controller/actions-runner-controller/issues/1279 https://github.com/actions-runner-controller/actions-runner-controller/issues/1238
2022-04-08T13:55:31Z DEBUG controllers.webhookbasedautoscaler Patching hra runner-16-ramdisk for capacityReservations update {"before": 0, "expired": 0, "amount": 1, "after": 1}
2022-04-08T13:55:31Z INFO controllers.webhookbasedautoscaler scaled runner-16-ramdisk by 1
2022-04-08T13:55:33Z DEBUG controllers.webhookbasedautoscaler Found some runner groups are managed by ARC {"event": "workflow_job", "hookID": "332460040", "delivery": "8f26cfa0-b743-11ec-93e7-d12e236a53a0", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z DEBUG controllers.webhookbasedautoscaler groups {"event": "workflow_job", "hookID": "332460040", "delivery": "8f26cfa0-b743-11ec-93e7-d12e236a53a0", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z DEBUG controllers.webhookbasedautoscaler job scale up target found {"event": "workflow_job", "hookID": "332460040", "delivery": "8f26cfa0-b743-11ec-93e7-d12e236a53a0", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "enterprise": "", "organization": "companyname", "repository": "companyname-frontend", "key": "companyname"}
2022-04-08T13:55:33Z DEBUG controllers.webhookbasedautoscaler Patching hra runner-16-ramdisk for capacityReservations update {"before": 1, "expired": 0, "amount": 1, "after": 2}
2022-04-08T13:55:33Z DEBUG controllers.webhookbasedautoscaler Found some runner groups are managed by ARC {"event": "workflow_job", "hookID": "332460040", "delivery": "8eb1c6b0-b743-11ec-8966-73bcb8c0ea68", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z DEBUG controllers.webhookbasedautoscaler groups {"event": "workflow_job", "hookID": "332460040", "delivery": "8eb1c6b0-b743-11ec-8966-73bcb8c0ea68", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z DEBUG controllers.webhookbasedautoscaler job scale up target found {"event": "workflow_job", "hookID": "332460040", "delivery": "8eb1c6b0-b743-11ec-8966-73bcb8c0ea68", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "enterprise": "", "organization": "companyname", "repository": "companyname-frontend", "key": "companyname"}
2022-04-08T13:55:33Z DEBUG controllers.webhookbasedautoscaler Patching hra runner-16-ramdisk for capacityReservations update {"before": 1, "expired": 0, "amount": 1, "after": 2}
2022-04-08T13:55:33Z INFO controllers.webhookbasedautoscaler scaled runner-16-ramdisk by 1
2022-04-08T13:55:33Z INFO controllers.webhookbasedautoscaler scaled runner-16-ramdisk by 1
2022-04-08T13:55:34Z DEBUG controllers.webhookbasedautoscaler Found some runner groups are managed by ARC {"event": "workflow_job", "hookID": "332460040", "delivery": "8f55cee0-b743-11ec-84dd-7ac8ac6c725b", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:34Z DEBUG controllers.webhookbasedautoscaler groups {"event": "workflow_job", "hookID": "332460040", "delivery": "8f55cee0-b743-11ec-84dd-7ac8ac6c725b", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:34Z DEBUG controllers.webhookbasedautoscaler job scale up target found {"event": "workflow_job", "hookID": "332460040", "delivery": "8f55cee0-b743-11ec-84dd-7ac8ac6c725b", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "enterprise": "", "organization": "companyname", "repository": "companyname-frontend", "key": "companyname"}
2022-04-08T13:55:34Z DEBUG controllers.webhookbasedautoscaler Patching hra runner-16-ramdisk for capacityReservations update {"before": 2, "expired": 0, "amount": 1, "after": 3}
2022-04-08T13:55:34Z INFO controllers.webhookbasedautoscaler scaled runner-16-ramdisk by 1
This is on k8s 1.22 with action-runner-release 0.22.2.
I propose a mutex surrounding the patch of the k8s resource to avoid the stale write as seen above.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 28 (10 by maintainers)
Commits related to this issue
- Make webhook-based scale operation asynchronous Ref #1321 — committed to actions/actions-runner-controller by mumoshu 2 years ago
- Make webhook-based scale operation asynchronous This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both end... — committed to actions/actions-runner-controller by mumoshu 2 years ago
- Fix confusing logs from pv and pvc controllers Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1321#issuecomment-1137431212 — committed to actions/actions-runner-controller by mumoshu 2 years ago
- Fix confusing logs from pv and pvc controllers (#1487) Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1321#issuecomment-1137431212 — committed to actions/actions-runner-controller by mumoshu 2 years ago
- Fix confusing logs from pv and pvc controllers (#1487) Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1321#issuecomment-1137431212 — committed to cablespaghetti/actions-runner-controller by mumoshu 2 years ago
- Make webhook-based scale race-free (#1477) * Make webhook-based scale operation asynchronous This prevents race condition in the webhook-based autoscaler when it received another webhook event whi... — committed to actions/actions-runner-controller by mumoshu 2 years ago
Took longer than I though to get to it. But we’re now running the branch. I’ll report back how things are once it runs for a bit.
Forgot to update. So far no issues and it seems to have fixed our issues.
@cablespaghetti Ah, good catch! That must be the thing that isn’t handled by my quick fix. Let me add one more fix so that it can use optimistic locking on HRA spec update to fully fix the race, and perhaps batching updates happened in several seconds so that it won’t stress your K8s API server a lot.
@genisd @cablespaghetti Hey! I finally had some time to actually work on this, and here’s my attempt to fix it without affecting the overall throughput of the webhook server: #1477 I would greatly appreciate it if you could give it a shot and report any issues. You should be able to build and test it in a local cluster following our contribution guide.
So I’m going to close this. Fixed in the 0.25 release
I’m running tagged
0.25.1
as of right now. So far so good, I’ll keep an eye on it for the next few days@Uriende any updates for us?
I’m currently extremely busy with other stuff, and with maternity leave around the corner I cannot promise anything sorry 😞 I did look at the code and it looked better than my hack 😄
I’ll try to give it a go today or tommorow. Thanks! @mumoshu
@raunyoliveira-hotmart That’s one of the expected debug log messages so you shouldn’t worry about it! (Note that our helm chart enables the debug logging by default
// See https://github.com/actions-runner-controller/actions-runner-controller/issues/1511
From my limited testing this branch does seem to work reliably. Thanks! 🎉
I am getting a LOT of log messages about unrelated PVCs/PVs when I’m not using persistent volumes with my runners though, but that’s unrelated I imagine.
So I’m running a mutex in our environment. The codebase is not structured to gate the read/write critical path only, so in my test I’m currently gating all http requests 🙈
The latencies are not super horrible which means that for us this is feasible for now. 95% percentile sits mostly at about 100ms. So for medium sized environments that approach probably would work, but not for really large environments. (Edit: really have to see how this goes when hooks start rolling in)