cilium: Unable update CRD identity information with a reference for this node
User report from Slack:
Curious if anyone has seen an error like
Unable update CRD identity information with a reference for this nodein their cilium pod logs? Might be a red herring. I’m running Cilium on EKS in CNI chaining mode. It’s been fine for a couple of weeks, but today it degraded to a state where if a pod gets scheduled on a node, it gets stuck in ContainerCreating. In the pod’s events, I see errors like:NetworkPlugin cni failed to set up pod "REDACTED" network: unable to create endpoint: Put http:///var/run/cilium/cilium.sock/v1/endpoint/cilium-local:0: context deadline exceeded. If I delete the cilium pod on the node, all other stuck pods get scheduled successfully once the cilium pod comes up again.
I see a lot of this error as well
2019-12-04T23:00:19.707077314Z level=warning msg="Key allocation attempt failed" attempt=11 error="unable to create slave key 'k8s:app=REDACTED;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=REDACTED;k8s:io.kubernetes.pod.namespace=default;k8s:service-type=rest;': Operation cannot be fulfilled on ciliumidentities.cilium.io \"64380\": StorageError: invalid object, Code: 4, Key: /registry/cilium.io/ciliumidentities/64380, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: f6d81d4c-0a28-11ea-b6ee-122bfb3499ba, UID in object meta: " key="[k8s:app=REDACTED k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=REDACTED k8s:io.kubernetes.pod.namespace=default k8s:service-type=rest]" subsys=allocatorThe error messageStorageError: invalid object, Code: 4seems to come from kubernetes/kube-apiserver. I also notice that the crd kvstore implementation uses a cache in front of kube-api. I’m wondering if there could be some cache invalidation issues at play here.
So we think we know what causes this issue, Amazon periodically upgrades the “eks-version” which includes upgrading the control plane to a newer patch version. During this time the control plane is restarted by AWS These changes are not rolled out to all EKS clusters at the same, they apparently do some kind of a random rollout process. Today two of our clusters got upgraded from eks.2->eks.6. After some pod churn both clusters starting seeing pods being failed to start. We also believe a similar upgrade happened around the time when we first saw this issue a few weeks ago (we think that the version went from 1->2).
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16 (12 by maintainers)
Commits related to this issue
- Add heartbeat that can kill client connections Adding a periodic heartbeat to check for stale connections to the apiserver. If a heartbeat request times out then client-go's conntrack dialer is used ... — committed to tom-hadlaw-hs/cilium by tom-hadlaw-hs 4 years ago
- Add heartbeat that can kill client connections Adding a periodic heartbeat to check for stale connections to the apiserver. If a heartbeat request times out then client-go's conntrack dialer is used ... — committed to cilium/cilium by tom-hadlaw-hs 4 years ago
- Add heartbeat that can kill client connections [ upstream commit e81979c189e54021da8be8f4fc4b00457f9dc166 ] Adding a periodic heartbeat to check for stale connections to the apiserver. If a heartbea... — committed to cilium/cilium by tom-hadlaw-hs 4 years ago
- Add heartbeat that can kill client connections [ upstream commit e81979c189e54021da8be8f4fc4b00457f9dc166 ] Adding a periodic heartbeat to check for stale connections to the apiserver. If a heartbea... — committed to cilium/cilium by tom-hadlaw-hs 4 years ago
Let me know if we can help with contributing a fix for this, we’d be happy to do so.