kube-image-keeper: Infinite loop of caching failed on some images and all following new images

Hi , i’m currently running kuik 1.20 with s3 backend and noticed several imaged failed to be cached with the following event:

│ Events:                                                                                                                            │
│   Type     Reason       Age                     From                    Message                                                    │
│   ----     ------       ----                    ----                    -------                                                    │
│   Normal   Caching      6m11s (x300 over 3d9h)  cachedimage-controller  Start caching image docker.io/bitnami/redis:6.2.10-debian-11-r13                                                                                                                             │
│   Warning  CacheFailed  6m9s (x300 over 3d9h)   cachedimage-controller  Failed to cache image docker.io/bitnami/redis:6.2.10-debian-11-r13, reason: POST http://kube-image-keeper-registry:5000/v2/docker.io/bitnami/redis/blobs/uploads/: unexpected status code 405 Method Not Allowed: Method not allowed   

Restarting the controller or registry pod doesn’t seem to fix it. I’ve also tried deleting the cached image but same result.

The only workaround that seems to solve it was to uninstall kuik, delete the s3 bucket, and reinstall it. But soon when an image fail to be cached, all new image can no longer be cached.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

so far so good, the case no longer happens. I think we can conclude that this is due to unclean garbage collection as a side effect of continuous reconciliation by argocd which conflict with kuik’s garbage collection’s job.

thanks @paullaffitte and @donch

Hi @rucciva ,

I’m able to reproduce your issue with the registry staying in MAINTENANCE mode.

I’ve identified 2 use cases when this can happen:

  • The garbage collector job runs while no images are cached, causing the garbage collector command to fail.
  • The garbage collector command can’t be executed correctly. This is what’s happening in your case because Linkerd injection is enabled in the Kuik’s namespace.

I’ve tried to force the command to be executed into the registry container here, but even with this method, the job can’t complete due to this behavior

To fix your issue, we suggest you disable Linkerd injection annotation for Kuik’s namespace.

Hi @donch

To confirm that, is it possible to disable the “autosync” feature in your argoCD ? At least the kuik’s components ?

yes, currently i set it up so that change to REGISTRY_STORAGE_MAINTENANCE_READONLY environment is ignored. Hoping that the problem did not re-occur.

I’ve just tried to drain a node on my cluster, and all my workloads are correctly rescheduled without issues.

This is also random and happens after some time when using the stateless registry. i’m not sure what triggers it