promo-tools: image promotion is very slow despite low resource utilization
When running in Kubernetes CI image promotion takes more than one hour, often multiple hours.
Worse, it takes an hour just running presubmit tests in k8s.io repo. As a result a quickly approved image promotion PR may take 3 hours or more to take affect.
I took a cursory pass at this by: https://github.com/kubernetes/test-infra/pull/27743 + https://github.com/kubernetes/test-infra/pull/27765 which set large (7 core, 40 Gi RAM) allocations, configure GOMAXPROCS to match, and increase --threads
(worker goroutines).
Since those changes, we can follow from a running job, to it’s pod, to the node the pod is on, to the backing GCE VM. On that VM, there should only be kubelet / system daemons, other workloads will not be scheduled here due to essentially no schedule-able resources being available (VM has 8 cores, we request 7, some are reserved for system agents).
We can see that CPU usage spikes when the job starts (git cloning?) and then the entire machine settles to just 3-7% utilization, outgoing connections only ~2/s to ~3/s, disk write < 0.6MiB/s, network egress < 0.7MiB/s far below the initial spike:
This implies we are bottlenecking heavily on something like waiting for the sigstore API …?
Copying images could otherwise conceivably bottleneck on CPU or Network but both of those are hardly in use.
I half-joked about synchronized logging and the volume of logs, but I don’t seriously think it’s contention over the logger.
Besides investigating what exactly is causing this slow-down, I think we should consider enabling running N postsubmit jobs each responsible for a subset of the images, only triggered when the respective manifest directories change, based on run_if_changed
plus some option to the image promoter to specify which manifests to read.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 32 (29 by maintainers)
Working on a faster version of image signature verification in https://github.com/kubernetes-sigs/release-sdk/pull/123
Can get a current sample as soon as I’m back at my desk, and thank you for working on this
This quota is per region though, so we have to be exceeding 1000 qps in one region
We also need a better release-sdk API for the usage of
VerifyImage
:https://github.com/kubernetes-sigs/promo-tools/blob/289d224460f8bd1d5d5b0cffadac7cd3ff1a9751/internal/promoter/image/sign.go#L135-L157
The method again checks if the image is signed (not required in our case) as well as does not reuse the transport.
Edit: Some ideas are now in https://github.com/kubernetes-sigs/release-sdk/issues/124
Looking at the issue “Validating signatures from staging: 20 secs” mentioned in the comment above. The method we measure is
ValidateStagingSignatures
:https://github.com/kubernetes-sigs/promo-tools/blob/b4fc93c17d39c0155179ff8c38b3ffbc4f91088a/internal/promoter/image/sign.go#L123-L125
It’s suspicious that we create a signer in
ValidateStagingSignatures
as well asFindSingedEdges
, especially the multi use of the throttlers in combination with the signing API is kinda hard to debug:https://github.com/kubernetes-sigs/promo-tools/blob/b4fc93c17d39c0155179ff8c38b3ffbc4f91088a/internal/promoter/image/sign.go#L74-L91
I propse to start with some more verbose logging and tracing the timestamps between log messages in https://github.com/kubernetes-sigs/promo-tools/pull/640