argo-cd: Error log on missing cache key

Hello

I getting this error log sometimes:

[argo-cd-argocd-server-bf74d748b-2xmgb] time="2020-12-16T03:45:46Z" level=error msg="finished streaming call with code Unknown" error="cache: key is missing" grpc.code=Unknown grpc.method=WatchResourceTree grpc.service=application.ApplicationService grpc.start_time="2020-12-16T03:45:22Z" grpc.time_ms=24038.105 span.kind=server system=grpc

I can see that it is marked as level error In my opinion missing key cache could be warning level instead

What do you think ?

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 4
  • Comments: 27 (9 by maintainers)

Most upvoted comments

image This is what happens when a user clicks on an object that should take them to a detail page.

We’re continuing to run into this as well periodically, but restarting the argocd-application-controller pod does not appear to resolve it for us.

I tried 2.5.0 and I still keep getting this error. Any recommendation on how to fix it permanently?

Note: for us, this meant we couldn’t see the resources for the application. This wasn’t some trivial thing. Eventually the problem went away, but it took a long time.

I’m actually seeing this error getting logged as fatal using argocd 1.8.2: level=fatal msg="rpc error: code = Unknown desc = cache: key is missing"

@james-callahan that is correct. Controller is trying to minimize number of writes to redis and don’t write the same message twice. Logic two skip second write is implemented here: https://github.com/argoproj/argo-cd/blob/a08282bf6bcd7b44ea2dd3ef7fa0e2a77498063e/util/cache/twolevelclient.go#L24

If “empty secret” is something affecting this (in newer versions of ArgoCD), here’s an example of how I’ve found such secrets (but as you can see here; there are cases where we don’t create them and might not have an easy way of just “removing them to get ArgoCD to work as it should”):

$ kubectl get secrets -A -o json | jq '.items[] | select(has("data") == false)'
{
  "apiVersion": "v1",
  "kind": "Secret",
  "metadata": {
    "creationTimestamp": "2021-06-30T15:06:11Z",
    "labels": {
      "managed-by": "prometheus-operator"
    },
    "name": "alertmanager-monitoring-prometheus-oper-alertmanager-tls-assets",
    "namespace": "monitoring",
    "ownerReferences": [
      {
        "apiVersion": "monitoring.coreos.com/v1",
        "blockOwnerDeletion": true,
        "controller": true,
        "kind": "Alertmanager",
        "name": "monitoring-prometheus-oper-alertmanager",
        "uid": "bd94541f-20a1-4625-bd4e-53262ea30f3a"
      }
    ],
    "resourceVersion": "34611341",
    "uid": "4727ce81-ed44-454e-9272-16a02acc868a"
  },
  "type": "Opaque"
}

Experiencing the same issue.

How to reproduce:

  1. Navigate to application detail page in web UI.
  2. Restart redis pod (non-HA setup)
  3. Click on any app resource
  4. Message mentioned in https://github.com/argoproj/argo-cd/issues/5068#issuecomment-878594783 is displayed.

Refresh of the page doesn’t help. The issue eventually disappears but not in minutes, rather in tens of minutes. It can be resolved immediately restarting the argocd-application-controller pod.

@alexmt As you closed https://github.com/argoproj/argo-cd/issues/6009 as duplicate I guess this should be marked as bug not enhancement. I think there is an issue reconnecting to redis from the controller when the connection is lost.

Ran into this today. For some reason, in the ha-setup, the ha-proxy health check was only executing health checks against the svc for the server R0 and R1 for all checks, the R2 was skipped in all cases.

I had to cause a manual sentinel failover:

/usr/local/etc/haproxy $ nc argocd-redis-ha-announce-0 26379
SENTINEL failover argocd

I ran this from the ha proxy box because I was checking connectivity anyways. After sentinel selected and promoted a new redis master, the checks against the R2 server started to work again, and this issue went away entirely.

I’m new to argocd, I started with 2.7.8, and it worked well for ten days, then I upgraded it to 2.8, and now I encountered the same error after upgrading it for one day. Not sure if it is related.

I downgraded to 2.7.8, and it is back to normal now.

@alexmt: Are you opposed to just removing that line?

And in the interim, can the description be updated to suggest shooting the application controller?

I got this today when I restarted redis. Doing a hard refresh of my application(s) didn’t help. Restarting the application controller fixed it.

I assume the issue I hit is that the application controller checked something in redis and is assuming that going forward it remains in redis.

Hi @alexmt

In the documentation I could see this

Argo CD is largely stateless, 
all data is persisted as Kubernetes objects, which in turn is stored in Kubernetes' etcd. 
Redis is only used as a throw-away cache and can be lost.
When lost, it will be rebuilt without loss of service.

The system keeps working alright when the cache is missing, IMO printing error level is when the system is not functioning properly and causes the system to unexpected behaviors . Printing warning is sounds better suitable