cilium: Endpoint contains wrong Pod IPv4 address

Is there an existing issue for this?

I have searched the existing issues

What happened?

We observed dropped traffic between Pods for which the corresponding NetworkPolicy actually should allow traffic. Upon investigation we figured out, that only one specific Pod (from a StatefulSet) as affected, and that the problem is caused by the fact that Cilium stored a wrong IPv4 address in the Pod’s CiliumEndpoint object.

The CiliumEndpoint’s status.networking.addressing.ipv4 field contained a wrong IPv4 address (10.8.142.14), not matching the effective IPv4 address of the Pod (10.8.2.32).

We have not yet figured out whether we can reliably reproduce this, right now it looks more like this is caused by a (rare) race condition (maybe related to StatefulSet specifics).

Endpoint

apiVersion: cilium.io/v2
kind: CiliumEndpoint
metadata:
  creationTimestamp: '2022-05-23T11:39:49Z'
  generation: 2
  labels:
    app.kubernetes.io/instance: grafana-agent-metrics
    app.kubernetes.io/managed-by: grafana-agent-operator
    app.kubernetes.io/name: grafana-agent
    app.kubernetes.io/version: v0.23.0
    controller-revision-hash: grafana-agent-metrics-shard-1-788dbb9c87
    grafana-agent: grafana-agent-metrics
    nx-k8s-topology-id: 8a8f0140-b805-44db-9dfe-625c6b5df899
    operator.agent.grafana.com/name: grafana-agent-metrics
    operator.agent.grafana.com/shard: '1'
    operator.agent.grafana.com/type: metrics
    statefulset.kubernetes.io/pod-name: grafana-agent-metrics-shard-1-0
  name: grafana-agent-metrics-shard-1-0
  namespace: grafana-agent-system
  ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      kind: Pod
      name: grafana-agent-metrics-shard-1-0
      uid: 296f1523-ceac-4906-b4e1-0b8f4c4d5c01
  resourceVersion: '456543044'
  uid: 1c88af6b-67c0-4043-85f7-b821165f00ab
  selfLink: >-
    /apis/cilium.io/v2/namespaces/grafana-agent-system/ciliumendpoints/grafana-agent-metrics-shard-1-0
status:
  encryption: {}
  external-identifiers:
    container-id: 1fd8732c92dbf8283a2ecd25908604c4db0420965ee7e4f0eaba1870002e017a
    k8s-namespace: grafana-agent-system
    k8s-pod-name: grafana-agent-metrics-shard-1-0
    pod-name: grafana-agent-system/grafana-agent-metrics-shard-1-0
  id: 873
  identity:
    id: 54072
    labels:
      - k8s:app.kubernetes.io/instance=grafana-agent-metrics
      - k8s:app.kubernetes.io/managed-by=grafana-agent-operator
      - k8s:app.kubernetes.io/name=grafana-agent
      - k8s:app.kubernetes.io/version=v0.23.0
      - >-
        k8s:io.cilium.k8s.namespace.labels.grafana-agent-system.tree.hnc.x-k8s.io/depth=0
      - >-
        k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=grafana-agent-system
      - >-
        k8s:io.cilium.k8s.namespace.labels.kustomize.toolkit.fluxcd.io/name=grafana-agent
      - >-
        k8s:io.cilium.k8s.namespace.labels.kustomize.toolkit.fluxcd.io/namespace=kube-system
      - k8s:io.cilium.k8s.namespace.labels.scheduling.nexxiot.com/fargate=false
      - k8s:io.cilium.k8s.policy.cluster=default
      - k8s:io.cilium.k8s.policy.serviceaccount=grafana-agent
      - k8s:io.kubernetes.pod.namespace=grafana-agent-system
  named-ports:
    - name: http-metrics
      port: 8080
      protocol: TCP
  networking:
    addressing:
      - ipv4: 10.8.142.14
    node: 10.8.3.174
  state: ready

Pod

apiVersion: v1
kind: Pod
metadata:
  name: grafana-agent-metrics-shard-1-0
  generateName: grafana-agent-metrics-shard-1-
  namespace: grafana-agent-system
  uid: 296f1523-ceac-4906-b4e1-0b8f4c4d5c01
  resourceVersion: '456543083'
  creationTimestamp: '2022-05-23T11:39:48Z'
  labels:
    app.kubernetes.io/instance: grafana-agent-metrics
    app.kubernetes.io/managed-by: grafana-agent-operator
    app.kubernetes.io/name: grafana-agent
    app.kubernetes.io/version: v0.23.0
    controller-revision-hash: grafana-agent-metrics-shard-1-788dbb9c87
    grafana-agent: grafana-agent-metrics
    nx-k8s-topology-id: 8a8f0140-b805-44db-9dfe-625c6b5df899
    operator.agent.grafana.com/name: grafana-agent-metrics
    operator.agent.grafana.com/shard: '1'
    operator.agent.grafana.com/type: metrics
    statefulset.kubernetes.io/pod-name: grafana-agent-metrics-shard-1-0
  annotations:
    kubectl.kubernetes.io/default-container: grafana-agent
    kubernetes.io/psp: k8s.privileged-host
  ownerReferences:
    - apiVersion: apps/v1
      kind: StatefulSet
      name: grafana-agent-metrics-shard-1
      uid: 116802d1-f8a2-43d1-863e-6de8e5ca8150
      controller: true
      blockOwnerDeletion: true
  selfLink: /api/v1/namespaces/grafana-agent-system/pods/grafana-agent-metrics-shard-1-0
status:
  phase: Running
  conditions:
    - type: Initialized
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2022-05-23T11:39:49Z'
    - type: Ready
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2022-05-23T11:39:54Z'
    - type: ContainersReady
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2022-05-23T11:39:54Z'
    - type: PodScheduled
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2022-05-23T11:39:48Z'
  hostIP: 10.8.4.230
  podIP: 10.8.2.32
  podIPs:
    - ip: 10.8.2.32
  startTime: '2022-05-23T11:39:49Z'
  containerStatuses:
    - name: config-reloader
      state:
        running:
          startedAt: '2022-05-23T11:39:52Z'
      lastState: {}
      ready: true
      restartCount: 0
      image: quay.io/prometheus-operator/prometheus-config-reloader:v0.47.0
      imageID: >-
        quay.io/prometheus-operator/prometheus-config-reloader@sha256:0029252e7cf8cf38fc58795928d4e1c746b9e609432a2ee5417a9cab4633b864
      containerID: >-
        containerd://ade201bd0787e0e4ae9aeff887a091fad689eeec70406c65918e63908eb6a328
      started: true
    - name: grafana-agent
      state:
        running:
          startedAt: '2022-05-23T11:39:52Z'
      lastState: {}
      ready: true
      restartCount: 0
      image: docker.io/grafana/agent:v0.23.0
      imageID: >-
        docker.io/grafana/agent@sha256:a0beeaa6642c69efa472d509be2e2cf97dcb1c7e74047cca59a9452bf068f763
      containerID: >-
        containerd://8535bdd6e890f8028dc90eb68dab2ab317ee8e615b90322b81d6ce8e125e1438
      started: true
  qosClass: Burstable
spec:
  ...

Cilium Version

1.11.3

Kernel Version

5.4.188-104.359.amzn2.x86_64

Kubernetes Version

1.21 (v1.21.5-eks-9017834)

Sysdump

Note: this is a ZIP compressed BZIP2 Tar archive (had to work the system, BZIP2 to reduce size and ZIP to make Github accept the file format).

cilium-sysdump-20220524-105632.tar.bz2.zip

Relevant log output

> hubble  observe -f --verdict DROPPED
May 24 08:49:02.049: 10.8.2.32:45754 <> kube-system/kube-state-metrics-2:8081 Policy denied DROPPED (TCP Flags: SYN)
May 24 08:49:02.049: 10.8.2.32:45754 <> kube-system/kube-state-metrics-2:8081 Policy denied DROPPED (TCP Flags: SYN)
May 24 08:49:03.077: 10.8.2.32:45754 <> kube-system/kube-state-metrics-2:8081 Policy denied DROPPED (TCP Flags: SYN)
May 24 08:49:03.077: 10.8.2.32:45754 <> kube-system/kube-state-metrics-2:8081 Policy denied DROPPED (TCP Flags: SYN)
May 24 08:49:05.093: 10.8.2.32:45754 <> kube-system/kube-state-metrics-2:8081 Policy denied DROPPED (TCP Flags: SYN)
May 24 08:49:05.093: 10.8.2.32:45754 <> kube-system/kube-state-metrics-2:8081 Policy denied DROPPED (TCP Flags: SYN)

Anything else?

NetworkPolicy

Just for completeness, the corresponding NetworkPolicy object:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: kube-state-metrics
  namespace: kube-system
  labels:
    kustomize.toolkit.fluxcd.io/name: kube-state-metrics
    kustomize.toolkit.fluxcd.io/namespace: kube-system
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
  ingress:
    - ports:
        - protocol: TCP
          port: 8080
        - protocol: TCP
          port: 8081
      from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: grafana-agent
          namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: grafana-agent-system
    - ports:
        - protocol: TCP
          port: 8080
        - protocol: TCP
          port: 8081
      from:
        - podSelector: {}
  egress:
    - ports:
        - protocol: TCP
          port: 443
        - protocol: TCP
          port: 6443
  policyTypes:
    - Ingress
    - Egress

Code of Conduct

I agree to follow this project’s Code of Conduct

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 19 (9 by maintainers)

Commits related to this issue

k8s/watchers: add uid to patch request document. This is intended to prevent endpoints from overwriting ciliumendpoints that have the same name but are being managed by a new endpoint sync. This can... — committed to tommyp1ckles/cilium by tommyp1ckles 2 years ago
pkg/watchers: prevent endpoints overwriting existing ciliumendpoints. Prevents endpointsynchronizer from taking ownership and managing ciliumendpoints, except in the case of endpoint restore where th... — committed to tommyp1ckles/cilium by tommyp1ckles 2 years ago
k8s/watchers: add uid to patch request document. This is intended to prevent endpoints from overwriting ciliumendpoints that have the same name but are being managed by a new endpoint sync. This can... — committed to cilium/cilium by tommyp1ckles 2 years ago
pkg/watchers: prevent endpoints overwriting existing ciliumendpoints. Prevents endpointsynchronizer from taking ownership and managing ciliumendpoints, except in the case of endpoint restore where th... — committed to cilium/cilium by tommyp1ckles 2 years ago

Most upvoted comments

And another one, now even with timestamps:

run.sh:

[2022-08-19 13:18:53] Restarting
[2022-08-19 13:18:53] statefulset.apps/web restarted
[2022-08-19 13:18:53] Waiting for 1 pods to be ready...
[2022-08-19 13:19:00] Waiting for partitioned roll out to finish: 1 out of 2 new pods have been updated...
[2022-08-19 13:19:00] Waiting for 1 pods to be ready...
[2022-08-19 13:19:06] partitioned roll out complete: 2 new pods have been updated...
[2022-08-19 13:19:06] Diffing
[2022-08-19 13:19:08] 10.233.68.163
[2022-08-19 13:19:08] 10.233.69.151
[2022-08-19 13:19:08] sleeping
[2022-08-19 13:19:10]
[2022-08-19 13:19:10]
[2022-08-19 13:19:10]
[2022-08-19 13:19:10] Diffing
[2022-08-19 13:19:11] Restarting
[2022-08-19 13:19:12] statefulset.apps/web restarted
[2022-08-19 13:19:12] Waiting for partitioned roll out to finish: 0 out of 2 new pods have been updated...
[2022-08-19 13:19:12] Waiting for 1 pods to be ready...
[2022-08-19 13:19:19] Waiting for partitioned roll out to finish: 1 out of 2 new pods have been updated...
[2022-08-19 13:19:19] Waiting for 1 pods to be ready...
[2022-08-19 13:19:24] partitioned roll out complete: 2 new pods have been updated...
[2022-08-19 13:19:24] Diffing
[2022-08-19 13:19:26] 1c1
[2022-08-19 13:19:26] < 10.233.69.144 10.233.69.151
[2022-08-19 13:19:26] ---
[2022-08-19 13:19:26] > 10.233.68.23 10.233.69.144



Cilium Endpoints:

[2022-08-19 13:19:00] web-1   103           36259                                                                        ready                  10.233.68.163
[2022-08-19 13:19:00] web-0   701           2120                                                                         waiting-for-identity   10.233.79.115
[2022-08-19 13:19:01] web-0   544           2120                                                                         ready                  10.233.69.151
[2022-08-19 13:19:02] web-1   103           36259                                                                        ready                  10.233.68.163
[2022-08-19 13:19:03] web-1   103           36259                                                                        ready                  10.233.68.163
[2022-08-19 13:19:04] web-0   544           2120                                                                         ready                  10.233.69.151
[2022-08-19 13:19:05] web-0   544           2120                                                                         ready                  10.233.69.151
[2022-08-19 13:19:06] web-1   103           36259                                                                        ready                  10.233.68.163
[2022-08-19 13:19:07] web-1   103           36259                                                                        waiting-for-identity   10.233.68.163
[2022-08-19 13:19:07] web-1   103           36259                                                                        waiting-for-identity   10.233.68.163
[2022-08-19 13:19:08] web-0   544           2120                                                                         ready                  10.233.69.151
[2022-08-19 13:19:09] web-0   544           2120                                                                         ready                  10.233.69.151
[2022-08-19 13:19:10] web-1   103           36259                                                                        waiting-for-identity   10.233.68.163
[2022-08-19 13:19:11] web-1   103           36259                                                                        waiting-for-identity   10.233.68.163
[2022-08-19 13:19:12] web-0   544           2120                                                                         waiting-for-identity   10.233.69.151
[2022-08-19 13:19:12] web-0   544           2120                                                                         waiting-for-identity   10.233.69.151
[2022-08-19 13:19:12] web-1   103           36259                                                                        waiting-for-identity   10.233.68.163
[2022-08-19 13:19:12] web-0   544           2120                                                                         waiting-for-identity   10.233.69.151
[2022-08-19 13:19:14] web-1   474           36259                                                                        ready                  10.233.69.144
[2022-08-19 13:19:14] web-0   544           2120                                                                         waiting-for-identity   10.233.69.151
[2022-08-19 13:19:15] web-0   544           2120                                                                         waiting-for-identity   10.233.69.151
[2022-08-19 13:19:16] web-1   474           36259                                                                        ready                  10.233.69.144
[2022-08-19 13:19:17] web-1   474           36259                                                                        ready                  10.233.69.144
[2022-08-19 13:19:18] web-0   544           2120                                                                         waiting-for-identity   10.233.69.151
[2022-08-19 13:19:19] web-0   544           2120                                                                         waiting-for-identity   10.233.69.151
[2022-08-19 13:19:20] web-1   474           36259                                                                        ready                  10.233.69.144
[2022-08-19 13:19:21] web-1   474           36259                                                                        ready                  10.233.69.144
[2022-08-19 13:19:21] web-0   3610          2120                                                                         ready                  10.233.68.23
[2022-08-19 13:19:21] web-0   3610          2120                                                                         ready                  10.233.68.23
[2022-08-19 13:19:22] web-0   3610          2120                                                                         ready                  10.233.68.23
[2022-08-19 13:19:23] web-1   474           36259                                                                        ready                  10.233.69.144
[2022-08-19 13:19:24] web-1   474           36259                                                                        ready                  10.233.69.144
[2022-08-19 13:19:24] web-1   474           36259                                                                        waiting-for-identity   10.233.69.144
[2022-08-19 13:19:24] web-0   3610          2120                                                                         ready                  10.233.68.23
[2022-08-19 13:19:24] web-0   544           54653                                                                        ready                  10.233.69.151
[2022-08-19 13:19:25] web-0   544           54653                                                                        ready                  10.233.69.151
[2022-08-19 13:19:26] web-1   474           36259                                                                        waiting-for-identity   10.233.69.144
[2022-08-19 13:19:33] web-1   474           31291                                                                        ready                  10.233.69.144


Pods:

[2022-08-19 13:19:00] web-0   1/1     Terminating         0          20s   10.233.79.115   k8s-worker04.corp   <none>           <none>
[2022-08-19 13:19:00] web-1   1/1     Running             0          7s    10.233.68.163   k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:00] web-0   1/1     Terminating         0          20s   10.233.79.115   k8s-worker04.corp   <none>           <none>
[2022-08-19 13:19:00] web-0   0/1     Pending             0          0s    <none>          <none>                                   <none>           <none>
[2022-08-19 13:19:00] web-0   0/1     Pending             0          0s    <none>          <none>                                   <none>           <none>
[2022-08-19 13:19:01] web-0   0/1     Pending             0          1s    <none>          k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:01] web-0   0/1     ContainerCreating   0          1s    <none>          k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:01] web-0   0/1     ContainerCreating   0          1s    <none>          k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:02] web-1   1/1     Running             0          9s    10.233.68.163   k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:03] web-1   1/1     Running             0          10s   10.233.68.163   k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:04] web-0   0/1     ContainerCreating   0          4s    <none>          k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:05] web-0   0/1     ContainerCreating   0          5s    <none>          k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:05] web-0   0/1     Running             0          5s    10.233.69.151   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:06] web-1   1/1     Running             0          13s   10.233.68.163   k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:06] web-0   1/1     Running             0          6s    10.233.69.151   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:07] web-1   1/1     Running             0          14s   10.233.68.163   k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:08] web-0   1/1     Running             0          8s    10.233.69.151   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:09] web-0   1/1     Running             0          9s    10.233.69.151   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:10] web-1   1/1     Running             0          17s   10.233.68.163   k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:11] web-1   1/1     Running             0          18s   10.233.68.163   k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:12] web-0   1/1     Running             0          12s   10.233.69.151   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:12] web-1   1/1     Terminating         0          19s   10.233.68.163   k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:12] web-1   1/1     Terminating         0          19s   10.233.68.163   k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:12] web-1   0/1     Pending             0          0s    <none>          <none>                                   <none>           <none>
[2022-08-19 13:19:12] web-0   1/1     Running             0          12s   10.233.69.151   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:13] web-1   0/1     Pending             0          1s    <none>          <none>                                   <none>           <none>
[2022-08-19 13:19:13] web-1   0/1     Pending             0          1s    <none>          k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:13] web-1   0/1     ContainerCreating   0          1s    <none>          k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:14] web-1   0/1     ContainerCreating   0          2s    <none>          k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:14] web-0   1/1     Running             0          14s   10.233.69.151   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:15] web-0   1/1     Running             0          15s   10.233.69.151   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:16] web-1   0/1     ContainerCreating   0          4s    <none>          k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:17] web-1   0/1     ContainerCreating   0          5s    <none>          k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:17] web-1   0/1     Running             0          5s    10.233.69.144   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:18] web-0   1/1     Running             0          18s   10.233.69.151   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:18] web-1   1/1     Running             0          6s    10.233.69.144   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:19] web-0   1/1     Terminating         0          19s   10.233.69.151   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:19] web-0   1/1     Terminating         0          19s   10.233.69.151   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:19] web-0   0/1     Pending             0          0s    <none>          <none>                                   <none>           <none>
[2022-08-19 13:19:19] web-0   0/1     Pending             0          0s    <none>          <none>                                   <none>           <none>
[2022-08-19 13:19:20] web-1   1/1     Running             0          8s    10.233.69.144   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:21] web-0   0/1     Pending             0          2s    <none>          k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:21] web-0   0/1     ContainerCreating   0          2s    <none>          k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:21] web-1   1/1     Running             0          9s    10.233.69.144   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:21] web-0   0/1     ContainerCreating   0          2s    <none>          k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:22] web-0   0/1     ContainerCreating   0          3s    <none>          k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:23] web-1   1/1     Running             0          11s   10.233.69.144   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:23] web-0   0/1     Running             0          4s    10.233.68.23    k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:24] web-1   1/1     Running             0          12s   10.233.69.144   k8s-worker03.corp   <none>           <none>
[2022-08-19 13:19:24] web-0   0/1     Running             0          5s    10.233.68.23    k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:24] web-0   1/1     Running             0          5s    10.233.68.23    k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:25] web-0   1/1     Running             0          6s    10.233.68.23    k8s-worker02.corp   <none>           <none>
[2022-08-19 13:19:26] web-1   1/1     Running             0          14s   10.233.69.144   k8s-worker03.corp   <none>           <none>

Let me know if you need more details. Generally about the cluster: It’s made up of 128 Thread workers that run a very mixed workload, from quickly spawning CI jobs to hundreds of tiny apps and long running big applications, there is a constant load on it -> It’s conceivable Cilium sees many events/s happening.

timbuchwaldt on Aug 19, 2022

I have been working on making a proof of concept to reproduce this: https://github.com/timbuchwaldt/cilium-19931-poc

I have seen it to generate unstable differences where CEPs and Pods went out of sync and at least two where they became stable and stayed like this.

You have to apply the sts.yaml to create a 2-pod statefulset running nginx, alter the scheduler.sh to select only scheduleable nodes (our workers have this brawn-label), run this, then start run.sh to start re-starting the pods. Upon success the run.sh script aborts, showing the diff between pod IPs and CEP IPs. Further validation is needed to see if this stays stable, I recommend running both kubectl get pods -o wide -w as well as kubectl get cep -o wide -w to see the changes occuring.

We are still working on making this reproduce the problem cleaner, as currently it’s unclear if this happes when being scheduled to the same nodes, different nodes, if labels have any effect and so on. I’ll keep this updated in case we get any closer to the actual problem.

Update:

Caught another stable reproduction:

# k get cep -o wide                                                      
NAME    ENDPOINT ID   IDENTITY ID   INGRESS ENFORCEMENT   EGRESS ENFORCEMENT   VISIBILITY POLICY   ENDPOINT STATE   IPV4            IPV6
web-0   366           2120                                                                         ready            10.233.67.106   
web-1   3164          36259                                                                        ready            10.233.67.138   

# k get pods -o wide                                                     
NAME    READY   STATUS    RESTARTS   AGE   IP              NODE                                     NOMINATED NODE   READINESS GATES
web-0   1/1     Running   0          11m   10.233.69.98    k8s-worker-03.corp   <none>           <none>
web-1   1/1     Running   0          11m   10.233.67.138   k8s-worker-01.corp   <none>           <none>

# k get ep 
NAME             ENDPOINTS                          AGE
nginx            10.233.67.138:80,10.233.69.98:80   71m
nginx-selected   <none>                             71m

In this case both pods were selected=false, the k8s endpoint objects were correct, the CEPs are and stayed out of sync of the actual values.

Our cluster is (mostly) on 1.23.7, running Cilium 1.12.1

timbuchwaldt on Aug 19, 2022

@aanm No (as mentioned above) we cannot reproduce this (yet). Nevertheless, this should not happen and maybe reconciliation loop must be fixed to check regularly whether the CiliumEndpoint still matches the Pod.

rustrial on May 24, 2022