promo-tools: image promotion is very slow despite low resource utilization

When running in Kubernetes CI image promotion takes more than one hour, often multiple hours.

Worse, it takes an hour just running presubmit tests in k8s.io repo. As a result a quickly approved image promotion PR may take 3 hours or more to take affect.

I took a cursory pass at this by: https://github.com/kubernetes/test-infra/pull/27743 + https://github.com/kubernetes/test-infra/pull/27765 which set large (7 core, 40 Gi RAM) allocations, configure GOMAXPROCS to match, and increase --threads (worker goroutines).

Since those changes, we can follow from a running job, to it’s pod, to the node the pod is on, to the backing GCE VM. On that VM, there should only be kubelet / system daemons, other workloads will not be scheduled here due to essentially no schedule-able resources being available (VM has 8 cores, we request 7, some are reserved for system agents).

We can see that CPU usage spikes when the job starts (git cloning?) and then the entire machine settles to just 3-7% utilization, outgoing connections only ~2/s to ~3/s, disk write < 0.6MiB/s, network egress < 0.7MiB/s far below the initial spike: Screen Shot 2022-10-18 at 8 34 40 PM

This implies we are bottlenecking heavily on something like waiting for the sigstore API …?

Copying images could otherwise conceivably bottleneck on CPU or Network but both of those are hardly in use.

I half-joked about synchronized logging and the volume of logs, but I don’t seriously think it’s contention over the logger.

Besides investigating what exactly is causing this slow-down, I think we should consider enabling running N postsubmit jobs each responsible for a subset of the images, only triggered when the respective manifest directories change, based on run_if_changed plus some option to the image promoter to specify which manifests to read.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 32 (29 by maintainers)

Most upvoted comments

Working on a faster version of image signature verification in https://github.com/kubernetes-sigs/release-sdk/pull/123

Can get a current sample as soon as I’m back at my desk, and thank you for working on this

The promoter will look for every image, in every mirror to see how the target registry compares to the source. I think that is it, we have a ton of images and the new mirrors multiplied those lookups about 9x which is why the logs now are humongous too 😓

This quota is per region though, so we have to be exceeding 1000 qps in one region

We also need a better release-sdk API for the usage of VerifyImage:

https://github.com/kubernetes-sigs/promo-tools/blob/289d224460f8bd1d5d5b0cffadac7cd3ff1a9751/internal/promoter/image/sign.go#L135-L157

The method again checks if the image is signed (not required in our case) as well as does not reuse the transport.

Edit: Some ideas are now in https://github.com/kubernetes-sigs/release-sdk/issues/124

Looking at the issue “Validating signatures from staging: 20 secs” mentioned in the comment above. The method we measure is ValidateStagingSignatures:

https://github.com/kubernetes-sigs/promo-tools/blob/b4fc93c17d39c0155179ff8c38b3ffbc4f91088a/internal/promoter/image/sign.go#L123-L125

It’s suspicious that we create a signer in ValidateStagingSignatures as well as FindSingedEdges, especially the multi use of the throttlers in combination with the signing API is kinda hard to debug:

https://github.com/kubernetes-sigs/promo-tools/blob/b4fc93c17d39c0155179ff8c38b3ffbc4f91088a/internal/promoter/image/sign.go#L74-L91

I propse to start with some more verbose logging and tracing the timestamps between log messages in https://github.com/kubernetes-sigs/promo-tools/pull/640