spinnaker: CloudDriver Stops Receiving Tags From GCR Until Restarted

Cloud Provider

Google Container Registry

Environment

I am running Spinnaker 1.4.2 deployed onto a small, dedicated Kubernetes 1.8 cluster running in AWS, one pod for each service.

Feature Area

Loading tags from Google Container Registry

Description

At first pulling tags from GCR works fine. But after a few days it will suddenly cease. Calls to Gate to retrieve tags return a 400 error. Logs from Clouddriver are included below. Recycling the Clouddriver pod fixes the problem for a few more days.

Steps to Reproduce

  1. Configure GCR using a service account per these instructions: (https://www.spinnaker.io/setup/providers/docker-registry/#google-container-registry). This includes enabling Google APIs, creating a service account, and using the JSON file to configure access via Halyard.
  2. Use Halyard to deploy the configuration
  3. Observe that Deck will show tags for registries in GCR
  4. Wait a few days and observe that Deck no longer shows tags, and Developer Console will show 400 responses for the request
  5. Recycle Clouddriver and observe that functionality in Deck is restored

Additional Details

Note: redacted actual Docker registry names below

2017-10-17 14:49:51.043 ERROR 1 --- [tionAction-2905] .d.r.p.a.DockerRegistryImageCachingAgent : Could not load tags for xxxx/yyyyy
2017-10-17 14:49:51.062 ERROR 1 --- [tionAction-2905] .d.r.p.a.DockerRegistryImageCachingAgent : Could not load tags for xxxx/zzzzz
2017-10-17 14:49:51.063  WARN 1 --- [tionAction-2905] n.s.c.d.r.a.v.a.DockerBearerTokenService : Your registry password has trailing whitespace, if this is unintentional authentication will fail.
2017-10-17 14:49:51.236  WARN 1 --- [tionAction-2905] n.s.c.d.r.a.v.a.DockerBearerTokenService : Your registry password has trailing whitespace, if this is unintentional authentication will fail.
2017-10-17 14:49:51.373  WARN 1 --- [tionAction-2905] n.s.c.d.r.a.v.a.DockerBearerTokenService : Your registry password has trailing whitespace, if this is unintentional authentication will fail.

We had also noticed the issue previously when running in LocalDebian mode, though we didn’t do complete research or collect logs at that time.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 13
  • Comments: 40 (14 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve also experienced issues with GKE+GCR.

TWIMC, I’ve been running a patched version of clouddriver (https://github.com/brantburnett/clouddriver/commit/188fb31f9e5e39e15e4f86916db296be8f54278a) in our Spinnaker 1.8 environment for the last 3 days and we have not had to recycle our pod in that time. That’s not a definitive result, but it is encouraging.

I can confirm that adding the extra roles did not resolve the issue for us. We’re currently running 1.7.2.

Seems this is a bug in GCR, here is the fix: https://github.com/spinnaker/clouddriver/pull/2215

@lwander I haven’t been able to find any suspicious log messages in my deployment (running via Kube on GKE). However, right now I have to kick the clouddriver POD every couple days to keep it communicating with our GCR. As mentioned above by others, it only will affect certain “applications” when it happens, even though two or more applications use the same docker registry account - so one application will keep working, and I can manually choose a tag to deploy, but another application will lose the ability to get tags from that same registry and I can no longer manually deploy either (spinning gear trying to get the tags). Each time I kick the clouddriver POD, it fixes it.

https://github.com/spinnaker/spinnaker.github.io/pull/585

I found and followed the instructions here for the service account roles… have not seen the issue again.

UPDATE:

IGNORE ABOVE… this does not resolve anything and we still see the issue and have to restart clouddriver daily.