actions-runner-controller: Race condition in upscale webhook

There is a race condition in the web base upscale, where 2 events occur at the same time. Essentially two events should scale +1, yielding in 2extra runners. But due to concurrency only 1 extra runner is booted.

Here a log from the webhook server, 4events are handled. But only 3 runners got reserved/booted. I’ve been suspecting this and I’m super happy I found this, because it means we can fix it.

Could explain many issues reported (and we ourselves are hitting this multiple times daily). https://github.com/actions-runner-controller/actions-runner-controller/issues/1279 https://github.com/actions-runner-controller/actions-runner-controller/issues/1238

2022-04-08T13:55:31Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra runner-16-ramdisk for capacityReservations update	{"before": 0, "expired": 0, "amount": 1, "after": 1}
2022-04-08T13:55:31Z	INFO	controllers.webhookbasedautoscaler	scaled runner-16-ramdisk by 1
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	Found some runner groups are managed by ARC	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f26cfa0-b743-11ec-93e7-d12e236a53a0", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	groups	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f26cfa0-b743-11ec-93e7-d12e236a53a0", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	job scale up target found	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f26cfa0-b743-11ec-93e7-d12e236a53a0", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "enterprise": "", "organization": "companyname", "repository": "companyname-frontend", "key": "companyname"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra runner-16-ramdisk for capacityReservations update	{"before": 1, "expired": 0, "amount": 1, "after": 2}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	Found some runner groups are managed by ARC	{"event": "workflow_job", "hookID": "332460040", "delivery": "8eb1c6b0-b743-11ec-8966-73bcb8c0ea68", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	groups	{"event": "workflow_job", "hookID": "332460040", "delivery": "8eb1c6b0-b743-11ec-8966-73bcb8c0ea68", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	job scale up target found	{"event": "workflow_job", "hookID": "332460040", "delivery": "8eb1c6b0-b743-11ec-8966-73bcb8c0ea68", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "enterprise": "", "organization": "companyname", "repository": "companyname-frontend", "key": "companyname"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra runner-16-ramdisk for capacityReservations update	{"before": 1, "expired": 0, "amount": 1, "after": 2}
2022-04-08T13:55:33Z	INFO	controllers.webhookbasedautoscaler	scaled runner-16-ramdisk by 1
2022-04-08T13:55:33Z	INFO	controllers.webhookbasedautoscaler	scaled runner-16-ramdisk by 1
2022-04-08T13:55:34Z	DEBUG	controllers.webhookbasedautoscaler	Found some runner groups are managed by ARC	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f55cee0-b743-11ec-84dd-7ac8ac6c725b", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:34Z	DEBUG	controllers.webhookbasedautoscaler	groups	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f55cee0-b743-11ec-84dd-7ac8ac6c725b", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:34Z	DEBUG	controllers.webhookbasedautoscaler	job scale up target found	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f55cee0-b743-11ec-84dd-7ac8ac6c725b", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "enterprise": "", "organization": "companyname", "repository": "companyname-frontend", "key": "companyname"}
2022-04-08T13:55:34Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra runner-16-ramdisk for capacityReservations update	{"before": 2, "expired": 0, "amount": 1, "after": 3}
2022-04-08T13:55:34Z	INFO	controllers.webhookbasedautoscaler	scaled runner-16-ramdisk by 1

This is on k8s 1.22 with action-runner-release 0.22.2.

I propose a mutex surrounding the patch of the k8s resource to avoid the stale write as seen above.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 28 (10 by maintainers)

Commits related to this issue

Most upvoted comments

Took longer than I though to get to it. But we’re now running the branch. I’ll report back how things are once it runs for a bit.

Forgot to update. So far no issues and it seems to have fixed our issues.

Could this be due to me running two instances of the webhook pods

@cablespaghetti Ah, good catch! That must be the thing that isn’t handled by my quick fix. Let me add one more fix so that it can use optimistic locking on HRA spec update to fully fix the race, and perhaps batching updates happened in several seconds so that it won’t stress your K8s API server a lot.

@genisd @cablespaghetti Hey! I finally had some time to actually work on this, and here’s my attempt to fix it without affecting the overall throughput of the webhook server: #1477 I would greatly appreciate it if you could give it a shot and report any issues. You should be able to build and test it in a local cluster following our contribution guide.

So I’m going to close this. Fixed in the 0.25 release

I’m running tagged 0.25.1 as of right now. So far so good, I’ll keep an eye on it for the next few days

@Uriende any updates for us?

I’m currently extremely busy with other stuff, and with maternity leave around the corner I cannot promise anything sorry 😞 I did look at the code and it looked better than my hack 😄

I’ll try to give it a go today or tommorow. Thanks! @mumoshu

@raunyoliveira-hotmart That’s one of the expected debug log messages so you shouldn’t worry about it! (Note that our helm chart enables the debug logging by default

// See https://github.com/actions-runner-controller/actions-runner-controller/issues/1511

From my limited testing this branch does seem to work reliably. Thanks! 🎉

I am getting a LOT of log messages about unrelated PVCs/PVs when I’m not using persistent volumes with my runners though, but that’s unrelated I imagine.

2022-05-25T15:20:40Z	INFO	actions-runner-controller.runnerpersistentvolumeclaim	Reconciling runner pvc	{"pvc": "observability/prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0"}
2022-05-25T15:20:40Z	INFO	actions-runner-controller.runnerpersistentvolumeclaim	Reconciling runner pvc	{"pvc": "monolith-planning-page-capacity-test/data-pet-db-postgresql-0"}
2022-05-25T15:20:42Z	DEBUG	actions-runner-controller.runnerpersistentvolume	Retrying sync until pvc gets removed	{"pv": "/pvc-c91139d3-e44d-44cc-bf94-45b1d8e478e6", "requeueAfter": "10s"}
2022-05-25T15:20:42Z	DEBUG	actions-runner-controller.runnerpersistentvolume	Retrying sync until pvc gets removed	{"pv": "/pvc-e6e8810f-8933-42ff-a122-a70ed466be26", "requeueAfter": "10s"}

So I’m running a mutex in our environment. The codebase is not structured to gate the read/write critical path only, so in my test I’m currently gating all http requests 🙈

The latencies are not super horrible which means that for us this is feasible for now. 95% percentile sits mostly at about 100ms. So for medium sized environments that approach probably would work, but not for really large environments. (Edit: really have to see how this goes when hooks start rolling in)