spinnaker: Clouddriver: stops syncing Docker registry at intermittent intervals

Issue Summary:

Potentially related issue: #4780

Clouddriver will stop syncing with our Docker registry at intermittent intervals. There is no error message logged by Clouddriver, event after setting the log level to TRACE.

Cloud Provider(s):

AWS EKS

Environment:

EKS v1.18
Halyard 1.40.0
Spinnaker 1.23.3
Clouddriver 7.0.3

Feature Area:

Clouddriver

Description:

Steps to Reproduce:

Additional Details:

When working correctly, Clouddriver will output a info message stating that it is going to start describing the cached items in the registry, and follow up with the amount of ids and tags it grabs:

2020-12-14 07:01:10.838  INFO 1 --- [cutionAction-29] .d.r.p.a.DockerRegistryImageCachingAgent : Describing items in <docker-registry>/DockerRegistryImageCachingAgent[1/1]
2020-12-14 07:01:13.103  INFO 1 --- [cutionAction-29] .d.r.p.a.DockerRegistryImageCachingAgent : Caching 21174 tagged images in <docker-registry>/DockerRegistryImageCachingAgent[1/1]
2020-12-14 07:01:13.103  INFO 1 --- [cutionAction-29] .d.r.p.a.DockerRegistryImageCachingAgent : Caching 21174 image ids in <docker-registry>/DockerRegistryImageCachingAgent[1/1]

However, I’ve noticed that after Clouddriver stops syncing the registry, the cached tags and ids are not printing, and we only get the message from the describe call:

2020-12-13 08:09:57.750  INFO 1 --- [utionAction-116] .d.r.p.a.DockerRegistryImageCachingAgent : Describing items in <docker-registry>/DockerRegistryImageCachingAgent[1/1]

From this point onwards, Clouddriver will not sync with the registry until it (Clouddriver) has been restarted. Once it has been restarted, it will sync with the registry just fine until this intermittent failure occurs again.

It seems like if there is a problem here with Clouddriver proper, the offending code would be somewhere in the buildCacheResult method of the DockerRegistryImageCachingAgent: https://github.com/spinnaker/clouddriver/blob/874cef3644c826a1932d9670d59ccb77a19e19a2/clouddriver-docker/src/main/groovy/com/netflix/spinnaker/clouddriver/docker/registry/provider/agent/DockerRegistryImageCachingAgent.groovy#L135

I have also checked connectivity to our Docker registry during these outages, and I can connect to it just fine while this issue occurs with Clouddriver.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 15
Comments: 57 (1 by maintainers)

Most upvoted comments

@spinnakerbot remove-label stale

philleonard on Apr 21, 2022

@spinnakerbot remove-label stale

rickie on Dec 4, 2022

@spinnakerbot remove-label stale

rickie on Oct 20, 2022

@spinnakerbot remove-label stale

rickie on Sep 5, 2022

@spinnakerbot remove-label stale

rickie on Jun 7, 2022

@vijay-sapient there hasn’t been any work to address this issue that I’m aware of. I suggest to use the SQL backend instead of redis, since all new work is targeted at the SQL backend and it is also much better for performance.

The same issue is present with the SQL backend. The only thing that helps is a daily restart of cloud driver

atyutyunnik on Nov 29, 2021

This issue is tagged as ‘stale’ and hasn’t been updated in 45 days, so we are tagging it as ‘to-be-closed’. It will be closed in 45 days unless updates are made. If you want to remove this label, comment:

@spinnakerbot remove-label to-be-closed

spinnakerbot on Oct 7, 2021

@kskewes Thank you! I will edit my comment with your suggestion.

ahoward-conga on Mar 24, 2021

Perhaps a kubectl rollout restart deploy clouddriver or using the label selector is a way to avoid downtime if you don’t have PodDisruptionBudget configured. Be good to understand where the issue is though…

Karl, thank you, the suggested restart is even better since there is no interruption of service anymore. The issue is too elusive but painful nonetheless.

atyutyunnik on Mar 24, 2021

It seems that upping the resource limits on our Redis instance has solved this issue for us for now. I’m going to consider this issue closed and I can create a new ticket or re-open if I encounter any more issues linked to the caching agent and the Docker registry.

Thank you for your help @german-muzquiz! I will be sure to inform you further and get a thread dump if we run into this again.

ahoward-conga on Feb 18, 2021