spinnaker: CloudDriver Stops Receiving Tags From GCR Until Restarted
Cloud Provider
Google Container Registry
Environment
I am running Spinnaker 1.4.2 deployed onto a small, dedicated Kubernetes 1.8 cluster running in AWS, one pod for each service.
Feature Area
Loading tags from Google Container Registry
Description
At first pulling tags from GCR works fine. But after a few days it will suddenly cease. Calls to Gate to retrieve tags return a 400 error. Logs from Clouddriver are included below. Recycling the Clouddriver pod fixes the problem for a few more days.
Steps to Reproduce
- Configure GCR using a service account per these instructions: (https://www.spinnaker.io/setup/providers/docker-registry/#google-container-registry). This includes enabling Google APIs, creating a service account, and using the JSON file to configure access via Halyard.
- Use Halyard to deploy the configuration
- Observe that Deck will show tags for registries in GCR
- Wait a few days and observe that Deck no longer shows tags, and Developer Console will show 400 responses for the request
- Recycle Clouddriver and observe that functionality in Deck is restored
Additional Details
Note: redacted actual Docker registry names below
2017-10-17 14:49:51.043 ERROR 1 --- [tionAction-2905] .d.r.p.a.DockerRegistryImageCachingAgent : Could not load tags for xxxx/yyyyy
2017-10-17 14:49:51.062 ERROR 1 --- [tionAction-2905] .d.r.p.a.DockerRegistryImageCachingAgent : Could not load tags for xxxx/zzzzz
2017-10-17 14:49:51.063 WARN 1 --- [tionAction-2905] n.s.c.d.r.a.v.a.DockerBearerTokenService : Your registry password has trailing whitespace, if this is unintentional authentication will fail.
2017-10-17 14:49:51.236 WARN 1 --- [tionAction-2905] n.s.c.d.r.a.v.a.DockerBearerTokenService : Your registry password has trailing whitespace, if this is unintentional authentication will fail.
2017-10-17 14:49:51.373 WARN 1 --- [tionAction-2905] n.s.c.d.r.a.v.a.DockerBearerTokenService : Your registry password has trailing whitespace, if this is unintentional authentication will fail.
We had also noticed the issue previously when running in LocalDebian mode, though we didn’t do complete research or collect logs at that time.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 13
- Comments: 40 (14 by maintainers)
Commits related to this issue
- [WIP] fix(provider/docker): Clear docker token cache after 401 If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subsequent ... — committed to brantburnett/clouddriver by brantburnett 6 years ago
- fix(provider/docker): Clear docker token cache after 401 If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subsequent reques... — committed to brantburnett/clouddriver by brantburnett 6 years ago
- fix(provider/docker): Clear docker token cache after 401 If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subsequent reques... — committed to brantburnett/clouddriver by brantburnett 6 years ago
- fix(provider/docker): Clear docker token cache after 401 (#2817) If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subseque... — committed to spinnaker/clouddriver by brantburnett 6 years ago
- fix(provider/docker): Clear docker token cache after 401 (#2817) If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subseque... — committed to bolcom/clouddriver by brantburnett 6 years ago
- fix(provider/docker): Clear docker token cache after 401 If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subsequent reques... — committed to brantburnett/clouddriver by brantburnett 6 years ago
- fix(provider/docker): Clear docker token cache after 401 (#2888) If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subseque... — committed to spinnaker/clouddriver by brantburnett 6 years ago
- fix(provider/docker): Clear docker token cache after 401 (#2818) If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subseque... — committed to spinnaker/clouddriver by brantburnett 6 years ago
- fix(provider/docker): Clear docker token cache after 401 (#2818) If a 401 is received from the Docker registry, the cached token is cleared so it isn't reused if a new token isn't acquired. Subseque... — committed to clanesf/clouddriver by brantburnett 6 years ago
I’ve also experienced issues with GKE+GCR.
TWIMC, I’ve been running a patched version of clouddriver (https://github.com/brantburnett/clouddriver/commit/188fb31f9e5e39e15e4f86916db296be8f54278a) in our Spinnaker 1.8 environment for the last 3 days and we have not had to recycle our pod in that time. That’s not a definitive result, but it is encouraging.
I can confirm that adding the extra roles did not resolve the issue for us. We’re currently running 1.7.2.
Seems this is a bug in GCR, here is the fix: https://github.com/spinnaker/clouddriver/pull/2215
@lwander I haven’t been able to find any suspicious log messages in my deployment (running via Kube on GKE). However, right now I have to kick the clouddriver POD every couple days to keep it communicating with our GCR. As mentioned above by others, it only will affect certain “applications” when it happens, even though two or more applications use the same docker registry account - so one application will keep working, and I can manually choose a tag to deploy, but another application will lose the ability to get tags from that same registry and I can no longer manually deploy either (spinning gear trying to get the tags). Each time I kick the clouddriver POD, it fixes it.
https://github.com/spinnaker/spinnaker.github.io/pull/585
I found and followed the instructions here for the service account roles… have not seen the issue again.
UPDATE:
IGNORE ABOVE… this does not resolve anything and we still see the issue and have to restart clouddriver daily.