cilium: Unable update CRD identity information with a reference for this node

User report from Slack:

Curious if anyone has seen an error like Unable update CRD identity information with a reference for this node in their cilium pod logs? Might be a red herring. I’m running Cilium on EKS in CNI chaining mode. It’s been fine for a couple of weeks, but today it degraded to a state where if a pod gets scheduled on a node, it gets stuck in ContainerCreating. In the pod’s events, I see errors like: NetworkPlugin cni failed to set up pod "REDACTED" network: unable to create endpoint: Put http:///var/run/cilium/cilium.sock/v1/endpoint/cilium-local:0: context deadline exceeded. If I delete the cilium pod on the node, all other stuck pods get scheduled successfully once the cilium pod comes up again.

I see a lot of this error as well 2019-12-04T23:00:19.707077314Z level=warning msg="Key allocation attempt failed" attempt=11 error="unable to create slave key 'k8s:app=REDACTED;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=REDACTED;k8s:io.kubernetes.pod.namespace=default;k8s:service-type=rest;': Operation cannot be fulfilled on ciliumidentities.cilium.io \"64380\": StorageError: invalid object, Code: 4, Key: /registry/cilium.io/ciliumidentities/64380, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: f6d81d4c-0a28-11ea-b6ee-122bfb3499ba, UID in object meta: " key="[k8s:app=REDACTED k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=REDACTED k8s:io.kubernetes.pod.namespace=default k8s:service-type=rest]" subsys=allocator The error message StorageError: invalid object, Code: 4 seems to come from kubernetes/kube-apiserver. I also notice that the crd kvstore implementation uses a cache in front of kube-api. I’m wondering if there could be some cache invalidation issues at play here.

So we think we know what causes this issue, Amazon periodically upgrades the “eks-version” which includes upgrading the control plane to a newer patch version. During this time the control plane is restarted by AWS These changes are not rolled out to all EKS clusters at the same, they apparently do some kind of a random rollout process. Today two of our clusters got upgraded from eks.2->eks.6. After some pod churn both clusters starting seeing pods being failed to start. We also believe a similar upgrade happened around the time when we first saw this issue a few weeks ago (we think that the version went from 1->2).

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 16 (12 by maintainers)

Commits related to this issue

Most upvoted comments

Let me know if we can help with contributing a fix for this, we’d be happy to do so.