argo-cd: Error log on missing cache key
Hello
I getting this error log sometimes:
[argo-cd-argocd-server-bf74d748b-2xmgb] time="2020-12-16T03:45:46Z" level=error msg="finished streaming call with code Unknown" error="cache: key is missing" grpc.code=Unknown grpc.method=WatchResourceTree grpc.service=application.ApplicationService grpc.start_time="2020-12-16T03:45:22Z" grpc.time_ms=24038.105 span.kind=server system=grpc
I can see that it is marked as level error In my opinion missing key cache could be warning level instead
What do you think ?
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 4
- Comments: 27 (9 by maintainers)
This is what happens when a user clicks on an object that should take them to a detail page.
We’re continuing to run into this as well periodically, but restarting the argocd-application-controller pod does not appear to resolve it for us.
I tried 2.5.0 and I still keep getting this error. Any recommendation on how to fix it permanently?
Note: for us, this meant we couldn’t see the resources for the application. This wasn’t some trivial thing. Eventually the problem went away, but it took a long time.
I’m actually seeing this error getting logged as
fatal
using argocd 1.8.2:level=fatal msg="rpc error: code = Unknown desc = cache: key is missing"
@james-callahan that is correct. Controller is trying to minimize number of writes to redis and don’t write the same message twice. Logic two skip second write is implemented here: https://github.com/argoproj/argo-cd/blob/a08282bf6bcd7b44ea2dd3ef7fa0e2a77498063e/util/cache/twolevelclient.go#L24
If “empty secret” is something affecting this (in newer versions of ArgoCD), here’s an example of how I’ve found such secrets (but as you can see here; there are cases where we don’t create them and might not have an easy way of just “removing them to get ArgoCD to work as it should”):
Experiencing the same issue.
How to reproduce:
Refresh of the page doesn’t help. The issue eventually disappears but not in minutes, rather in tens of minutes. It can be resolved immediately restarting the
argocd-application-controller
pod.@alexmt As you closed https://github.com/argoproj/argo-cd/issues/6009 as duplicate I guess this should be marked as bug not enhancement. I think there is an issue reconnecting to redis from the controller when the connection is lost.
Ran into this today. For some reason, in the ha-setup, the ha-proxy health check was only executing health checks against the svc for the server R0 and R1 for all checks, the R2 was skipped in all cases.
I had to cause a manual sentinel failover:
I ran this from the ha proxy box because I was checking connectivity anyways. After sentinel selected and promoted a new redis master, the checks against the R2 server started to work again, and this issue went away entirely.
I downgraded to 2.7.8, and it is back to normal now.
@alexmt: Are you opposed to just removing that line?
And in the interim, can the description be updated to suggest shooting the application controller?
I got this today when I restarted redis. Doing a hard refresh of my application(s) didn’t help. Restarting the application controller fixed it.
I assume the issue I hit is that the application controller checked something in redis and is assuming that going forward it remains in redis.
Hi @alexmt
In the documentation I could see this
The system keeps working alright when the cache is missing, IMO printing error level is when the system is not functioning properly and causes the system to unexpected behaviors . Printing warning is sounds better suitable