actions-runner-controller: Race condition in upscale webhook

There is a race condition in the web base upscale, where 2 events occur at the same time. Essentially two events should scale +1, yielding in 2extra runners. But due to concurrency only 1 extra runner is booted.

Here a log from the webhook server, 4events are handled. But only 3 runners got reserved/booted. I’ve been suspecting this and I’m super happy I found this, because it means we can fix it.

Could explain many issues reported (and we ourselves are hitting this multiple times daily). https://github.com/actions-runner-controller/actions-runner-controller/issues/1279 https://github.com/actions-runner-controller/actions-runner-controller/issues/1238

2022-04-08T13:55:31Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra runner-16-ramdisk for capacityReservations update	{"before": 0, "expired": 0, "amount": 1, "after": 1}
2022-04-08T13:55:31Z	INFO	controllers.webhookbasedautoscaler	scaled runner-16-ramdisk by 1
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	Found some runner groups are managed by ARC	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f26cfa0-b743-11ec-93e7-d12e236a53a0", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	groups	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f26cfa0-b743-11ec-93e7-d12e236a53a0", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	job scale up target found	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f26cfa0-b743-11ec-93e7-d12e236a53a0", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "enterprise": "", "organization": "companyname", "repository": "companyname-frontend", "key": "companyname"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra runner-16-ramdisk for capacityReservations update	{"before": 1, "expired": 0, "amount": 1, "after": 2}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	Found some runner groups are managed by ARC	{"event": "workflow_job", "hookID": "332460040", "delivery": "8eb1c6b0-b743-11ec-8966-73bcb8c0ea68", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	groups	{"event": "workflow_job", "hookID": "332460040", "delivery": "8eb1c6b0-b743-11ec-8966-73bcb8c0ea68", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	job scale up target found	{"event": "workflow_job", "hookID": "332460040", "delivery": "8eb1c6b0-b743-11ec-8966-73bcb8c0ea68", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "enterprise": "", "organization": "companyname", "repository": "companyname-frontend", "key": "companyname"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra runner-16-ramdisk for capacityReservations update	{"before": 1, "expired": 0, "amount": 1, "after": 2}
2022-04-08T13:55:33Z	INFO	controllers.webhookbasedautoscaler	scaled runner-16-ramdisk by 1
2022-04-08T13:55:33Z	INFO	controllers.webhookbasedautoscaler	scaled runner-16-ramdisk by 1
2022-04-08T13:55:34Z	DEBUG	controllers.webhookbasedautoscaler	Found some runner groups are managed by ARC	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f55cee0-b743-11ec-84dd-7ac8ac6c725b", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:34Z	DEBUG	controllers.webhookbasedautoscaler	groups	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f55cee0-b743-11ec-84dd-7ac8ac6c725b", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:34Z	DEBUG	controllers.webhookbasedautoscaler	job scale up target found	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f55cee0-b743-11ec-84dd-7ac8ac6c725b", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "enterprise": "", "organization": "companyname", "repository": "companyname-frontend", "key": "companyname"}
2022-04-08T13:55:34Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra runner-16-ramdisk for capacityReservations update	{"before": 2, "expired": 0, "amount": 1, "after": 3}
2022-04-08T13:55:34Z	INFO	controllers.webhookbasedautoscaler	scaled runner-16-ramdisk by 1

This is on k8s 1.22 with action-runner-release 0.22.2.

I propose a mutex surrounding the patch of the k8s resource to avoid the stale write as seen above.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 28 (10 by maintainers)

Commits related to this issue

Make webhook-based scale operation asynchronous Ref #1321 — committed to actions/actions-runner-controller by mumoshu 2 years ago
Make webhook-based scale operation asynchronous This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both end... — committed to actions/actions-runner-controller by mumoshu 2 years ago
Fix confusing logs from pv and pvc controllers Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1321#issuecomment-1137431212 — committed to actions/actions-runner-controller by mumoshu 2 years ago
Fix confusing logs from pv and pvc controllers (#1487) Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1321#issuecomment-1137431212 — committed to actions/actions-runner-controller by mumoshu 2 years ago
Fix confusing logs from pv and pvc controllers (#1487) Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1321#issuecomment-1137431212 — committed to cablespaghetti/actions-runner-controller by mumoshu 2 years ago
Make webhook-based scale race-free (#1477) * Make webhook-based scale operation asynchronous This prevents race condition in the webhook-based autoscaler when it received another webhook event whi... — committed to actions/actions-runner-controller by mumoshu 2 years ago

Most upvoted comments

Took longer than I though to get to it. But we’re now running the branch. I’ll report back how things are once it runs for a bit.

Uriende on Jun 27, 2022

Forgot to update. So far no issues and it seems to have fixed our issues.

Uriende on Jul 11, 2022

Could this be due to me running two instances of the webhook pods

@cablespaghetti Ah, good catch! That must be the thing that isn’t handled by my quick fix. Let me add one more fix so that it can use optimistic locking on HRA spec update to fully fix the race, and perhaps batching updates happened in several seconds so that it won’t stress your K8s API server a lot.

mumoshu on Jun 6, 2022

@genisd @cablespaghetti Hey! I finally had some time to actually work on this, and here’s my attempt to fix it without affecting the overall throughput of the webhook server: #1477 I would greatly appreciate it if you could give it a shot and report any issues. You should be able to build and test it in a local cluster following our contribution guide.

mumoshu on May 24, 2022

So I’m going to close this. Fixed in the 0.25 release

genisd on Jul 12, 2022

I’m running tagged 0.25.1 as of right now. So far so good, I’ll keep an eye on it for the next few days

genisd on Jul 11, 2022

@Uriende any updates for us?

toast-gear on Jul 7, 2022

I’m currently extremely busy with other stuff, and with maternity leave around the corner I cannot promise anything sorry 😞 I did look at the code and it looked better than my hack 😄

genisd on Jun 23, 2022

I’ll try to give it a go today or tommorow. Thanks! @mumoshu

Uriende on Jun 22, 2022

@raunyoliveira-hotmart That’s one of the expected debug log messages so you shouldn’t worry about it! (Note that our helm chart enables the debug logging by default

// See https://github.com/actions-runner-controller/actions-runner-controller/issues/1511

mumoshu on Jun 15, 2022

From my limited testing this branch does seem to work reliably. Thanks! 🎉

I am getting a LOT of log messages about unrelated PVCs/PVs when I’m not using persistent volumes with my runners though, but that’s unrelated I imagine.

2022-05-25T15:20:40Z	INFO	actions-runner-controller.runnerpersistentvolumeclaim	Reconciling runner pvc	{"pvc": "observability/prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0"}
2022-05-25T15:20:40Z	INFO	actions-runner-controller.runnerpersistentvolumeclaim	Reconciling runner pvc	{"pvc": "monolith-planning-page-capacity-test/data-pet-db-postgresql-0"}
2022-05-25T15:20:42Z	DEBUG	actions-runner-controller.runnerpersistentvolume	Retrying sync until pvc gets removed	{"pv": "/pvc-c91139d3-e44d-44cc-bf94-45b1d8e478e6", "requeueAfter": "10s"}
2022-05-25T15:20:42Z	DEBUG	actions-runner-controller.runnerpersistentvolume	Retrying sync until pvc gets removed	{"pv": "/pvc-e6e8810f-8933-42ff-a122-a70ed466be26", "requeueAfter": "10s"}

cablespaghetti on May 25, 2022

So I’m running a mutex in our environment. The codebase is not structured to gate the read/write critical path only, so in my test I’m currently gating all http requests 🙈

The latencies are not super horrible which means that for us this is feasible for now. 95% percentile sits mostly at about 100ms. So for medium sized environments that approach probably would work, but not for really large environments. (Edit: really have to see how this goes when hooks start rolling in)

genisd on Apr 11, 2022