spinnaker: CloudDriver Stops Receiving Tags From GCR Until Restarted

Cloud Provider

Google Container Registry

Environment

I am running Spinnaker 1.4.2 deployed onto a small, dedicated Kubernetes 1.8 cluster running in AWS, one pod for each service.

Feature Area

Loading tags from Google Container Registry

Description

At first pulling tags from GCR works fine. But after a few days it will suddenly cease. Calls to Gate to retrieve tags return a 400 error. Logs from Clouddriver are included below. Recycling the Clouddriver pod fixes the problem for a few more days.

Steps to Reproduce

Configure GCR using a service account per these instructions: (https://www.spinnaker.io/setup/providers/docker-registry/#google-container-registry). This includes enabling Google APIs, creating a service account, and using the JSON file to configure access via Halyard.
Use Halyard to deploy the configuration
Observe that Deck will show tags for registries in GCR
Wait a few days and observe that Deck no longer shows tags, and Developer Console will show 400 responses for the request
Recycle Clouddriver and observe that functionality in Deck is restored

Additional Details

Note: redacted actual Docker registry names below

2017-10-17 14:49:51.043 ERROR 1 --- [tionAction-2905] .d.r.p.a.DockerRegistryImageCachingAgent : Could not load tags for xxxx/yyyyy
2017-10-17 14:49:51.062 ERROR 1 --- [tionAction-2905] .d.r.p.a.DockerRegistryImageCachingAgent : Could not load tags for xxxx/zzzzz
2017-10-17 14:49:51.063  WARN 1 --- [tionAction-2905] n.s.c.d.r.a.v.a.DockerBearerTokenService : Your registry password has trailing whitespace, if this is unintentional authentication will fail.
2017-10-17 14:49:51.236  WARN 1 --- [tionAction-2905] n.s.c.d.r.a.v.a.DockerBearerTokenService : Your registry password has trailing whitespace, if this is unintentional authentication will fail.
2017-10-17 14:49:51.373  WARN 1 --- [tionAction-2905] n.s.c.d.r.a.v.a.DockerBearerTokenService : Your registry password has trailing whitespace, if this is unintentional authentication will fail.

We had also noticed the issue previously when running in LocalDebian mode, though we didn’t do complete research or collect logs at that time.

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 13
Comments: 40 (14 by maintainers)

Commits related to this issue

[WIP] fix(provider/docker): Clear docker token cache after 401 If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subsequent ... — committed to brantburnett/clouddriver by brantburnett 6 years ago
fix(provider/docker): Clear docker token cache after 401 If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subsequent reques... — committed to brantburnett/clouddriver by brantburnett 6 years ago
fix(provider/docker): Clear docker token cache after 401 If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subsequent reques... — committed to brantburnett/clouddriver by brantburnett 6 years ago
fix(provider/docker): Clear docker token cache after 401 (#2817) If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subseque... — committed to spinnaker/clouddriver by brantburnett 6 years ago
fix(provider/docker): Clear docker token cache after 401 (#2817) If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subseque... — committed to bolcom/clouddriver by brantburnett 6 years ago
fix(provider/docker): Clear docker token cache after 401 If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subsequent reques... — committed to brantburnett/clouddriver by brantburnett 6 years ago
fix(provider/docker): Clear docker token cache after 401 (#2888) If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subseque... — committed to spinnaker/clouddriver by brantburnett 6 years ago
fix(provider/docker): Clear docker token cache after 401 (#2818) If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subseque... — committed to spinnaker/clouddriver by brantburnett 6 years ago
fix(provider/docker): Clear docker token cache after 401 (#2818) If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subseque... — committed to clanesf/clouddriver by brantburnett 6 years ago

Most upvoted comments

I’ve also experienced issues with GKE+GCR.

selslack on Oct 26, 2017

TWIMC, I’ve been running a patched version of clouddriver (https://github.com/brantburnett/clouddriver/commit/188fb31f9e5e39e15e4f86916db296be8f54278a) in our Spinnaker 1.8 environment for the last 3 days and we have not had to recycle our pod in that time. That’s not a definitive result, but it is encouraging.

brantburnett on Jul 27, 2018

I can confirm that adding the extra roles did not resolve the issue for us. We’re currently running 1.7.2.

brantburnett on May 8, 2018

Seems this is a bug in GCR, here is the fix: https://github.com/spinnaker/clouddriver/pull/2215

lwander on Dec 8, 2017

@lwander I haven’t been able to find any suspicious log messages in my deployment (running via Kube on GKE). However, right now I have to kick the clouddriver POD every couple days to keep it communicating with our GCR. As mentioned above by others, it only will affect certain “applications” when it happens, even though two or more applications use the same docker registry account - so one application will keep working, and I can manually choose a tag to deploy, but another application will lose the ability to get tags from that same registry and I can no longer manually deploy either (spinning gear trying to get the tags). Each time I kick the clouddriver POD, it fixes it.

ryan-mf on Dec 4, 2017

https://github.com/spinnaker/spinnaker.github.io/pull/585

I found and followed the instructions here for the service account roles… have not seen the issue again.

UPDATE:

IGNORE ABOVE… this does not resolve anything and we still see the issue and have to restart clouddriver daily.

joshughes on May 9, 2018