spinnaker: Clouddriver: stops syncing Docker registry at intermittent intervals
Issue Summary:
Potentially related issue: #4780
Clouddriver will stop syncing with our Docker registry at intermittent intervals. There is no error message logged by Clouddriver, event after setting the log level to TRACE.
Cloud Provider(s):
AWS EKS
Environment:
- EKS v1.18
- Halyard 1.40.0
- Spinnaker 1.23.3
- Clouddriver 7.0.3
Feature Area:
Clouddriver
Description:
Steps to Reproduce:
Additional Details:
When working correctly, Clouddriver will output a info message stating that it is going to start describing the cached items in the registry, and follow up with the amount of ids and tags it grabs:
2020-12-14 07:01:10.838 INFO 1 --- [cutionAction-29] .d.r.p.a.DockerRegistryImageCachingAgent : Describing items in <docker-registry>/DockerRegistryImageCachingAgent[1/1]
2020-12-14 07:01:13.103 INFO 1 --- [cutionAction-29] .d.r.p.a.DockerRegistryImageCachingAgent : Caching 21174 tagged images in <docker-registry>/DockerRegistryImageCachingAgent[1/1]
2020-12-14 07:01:13.103 INFO 1 --- [cutionAction-29] .d.r.p.a.DockerRegistryImageCachingAgent : Caching 21174 image ids in <docker-registry>/DockerRegistryImageCachingAgent[1/1]
However, I’ve noticed that after Clouddriver stops syncing the registry, the cached tags and ids are not printing, and we only get the message from the describe call:
2020-12-13 08:09:57.750 INFO 1 --- [utionAction-116] .d.r.p.a.DockerRegistryImageCachingAgent : Describing items in <docker-registry>/DockerRegistryImageCachingAgent[1/1]
From this point onwards, Clouddriver will not sync with the registry until it (Clouddriver) has been restarted. Once it has been restarted, it will sync with the registry just fine until this intermittent failure occurs again.
It seems like if there is a problem here with Clouddriver proper, the offending code would be somewhere in the buildCacheResult method of the DockerRegistryImageCachingAgent: https://github.com/spinnaker/clouddriver/blob/874cef3644c826a1932d9670d59ccb77a19e19a2/clouddriver-docker/src/main/groovy/com/netflix/spinnaker/clouddriver/docker/registry/provider/agent/DockerRegistryImageCachingAgent.groovy#L135
I have also checked connectivity to our Docker registry during these outages, and I can connect to it just fine while this issue occurs with Clouddriver.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 15
- Comments: 57 (1 by maintainers)
@spinnakerbot remove-label stale
@spinnakerbot remove-label stale
@spinnakerbot remove-label stale
@spinnakerbot remove-label stale
@spinnakerbot remove-label stale
The same issue is present with the SQL backend. The only thing that helps is a daily restart of cloud driver
This issue is tagged as ‘stale’ and hasn’t been updated in 45 days, so we are tagging it as ‘to-be-closed’. It will be closed in 45 days unless updates are made. If you want to remove this label, comment:
@kskewes Thank you! I will edit my comment with your suggestion.
Karl, thank you, the suggested restart is even better since there is no interruption of service anymore. The issue is too elusive but painful nonetheless.
It seems that upping the resource limits on our Redis instance has solved this issue for us for now. I’m going to consider this issue closed and I can create a new ticket or re-open if I encounter any more issues linked to the caching agent and the Docker registry.
Thank you for your help @german-muzquiz! I will be sure to inform you further and get a thread dump if we run into this again.