cilium: Cilium egress gateway drops returning traffic with INVALID IDENTITY (171)

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Egress Gateway (using the NAT policy CRD) drops traffic on the way back on the initial source node.

What happens:

node1 (FORWARD) -> egress gateway (FORWARD) -> destination
node1 (DROP)    <- egress gateway (FORWARD) <- destination

where the pod that originates traffic resides on node1.

Cilium Version

1.11.6

Kernel Version

5.15.48-flatcar

Kubernetes Version

1.23

Sysdump

No response

Relevant log output

{"time":"2022-07-21T11:07:33.540030606Z","verdict":"FORWARDED","ethernet":{"source":"aa:08:f6:94:af:ea","destination":"c6:b7:d8:63:8f:07"},"IP":{"source":"10.0.6.113","destination":"10.140.14.106","ipVersion":"IPv4"},"l4":{"TCP":{"source_port":47434,"destination_port":80,"flags":{"SYN":true}}},"source":{"ID":3553,"identity":375,"namespace":"infra","labels":["k8s:app=minio","k8s:io.cilium.k8s.namespace.labels.argocd.argoproj.io/instance=argocd","k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=infra","k8s:io.cilium.k8s.namespace.labels.tanka.dev/environment=efcd11957e5ae2299d24b4b536353399e5414fd5f269fc18","k8s:io.cilium.k8s.policy.cluster=default","k8s:io.cilium.k8s.policy.serviceaccount=minio-sa","k8s:io.kompose.service=minio","k8s:io.kubernetes.pod.namespace=infra","k8s:release=minio","k8s:statefulset.kubernetes.io/pod-name=minio-0"],"pod_name":"minio-0","workloads":[{"name":"minio","kind":"StatefulSet"}]},"destination":{"identity":16777217,"labels":["cidr:0.0.0.0/1","reserved:world"]},"Type":"L3_L4","node_name":"minio1","event_type":{"type":4,"sub_type":4},"trace_observation_point":"TO_OVERLAY","interface":{"index":7,"name":"cilium_vxlan"},"Summary":"TCP Flags: SYN"}
{"time":"2022-07-21T11:07:33.541186788Z","verdict":"DROPPED","drop_reason":171,"ethernet":{"source":"32:ad:11:cb:0e:c1","destination":"16:47:a4:64:b5:4c"},"IP":{"source":"10.140.14.106","destination":"10.0.6.113","ipVersion":"IPv4"},"l4":{"TCP":{"source_port":80,"destination_port":47434,"flags":{"SYN":true,"ACK":true}}},"source":{"identity":1,"labels":["reserved:host"]},"destination":{"ID":3553,"identity":375,"namespace":"infra","labels":["k8s:app=minio","k8s:io.cilium.k8s.namespace.labels.argocd.argoproj.io/instance=argocd","k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=infra","k8s:io.cilium.k8s.namespace.labels.tanka.dev/environment=efcd11957e5ae2299d24b4b536353399e5414fd5f269fc18","k8s:io.cilium.k8s.policy.cluster=default","k8s:io.cilium.k8s.policy.serviceaccount=minio-sa","k8s:io.kompose.service=minio","k8s:io.kubernetes.pod.namespace=infra","k8s:release=minio","k8s:statefulset.kubernetes.io/pod-name=minio-0"],"pod_name":"minio-0","workloads":[{"name":"minio","kind":"StatefulSet"}]},"Type":"L3_L4","node_name":"minio1","event_type":{"type":1,"sub_type":171},"drop_reason_desc":"INVALID_IDENTITY","Summary":"TCP Flags: SYN, ACK"}

Anything else?

No response

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 20 (2 by maintainers)

Most upvoted comments

Hi, I have the same issue here on cilium 1.12.4.

  • Everything works well when the pod runs on the same node having the egress IP
  • However, when the traffic is forwarded to another node, then the connectivity doesn’t work. It appears that the packet is dropped on the pod’s node

Relevant tcpdump on the host running the pod:


10:09:19.174365 lxc0213f60fa9fe In  IP 10.42.2.13.38184 > 10.10.0.113.smtp: Flags [S], seq 1524265224, win 64860, options [mss 1410,sackOK,TS val 2939330477 ecr 0,nop,wscale 7], length 0
10:09:19.174408 cilium_vxlan Out IP 10.42.2.13.38184 > 10.10.0.113.smtp: Flags [S], seq 1524265224, win 64860, options [mss 1410,sackOK,TS val 2939330477 ecr 0,nop,wscale 7], length 0

--- here the packet goes to the relevant egress node, and comes back ---

10:09:19.176641 cilium_vxlan P   IP 10.10.0.113.smtp > 10.42.2.13.38184: Flags [S.], seq 870859402, ack 1524265225, win 27960, options [mss 1410,sackOK,TS val 1263761709 ecr 2939330477,nop,wscale 7], length 0```

--- We don't see the packet on the lxc iface

And the relevant cilium monitor -t drop logs:

xx drop (Invalid identity) flow 0x0 to endpoint 0, file unknown line 339, , identity host->unknown: 10.10.0.113:25 -> 10.42.2.13:38184 tcp SYN, ACK

Here’s the cilium config:

KVStore:                 Ok   Disabled
Kubernetes:              Ok   1.24 (v1.24.8+rke2r1) [linux/amd64]
Kubernetes APIs:         ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEgressGatewayPolicy", "cilium/v2::CiliumEgressNATPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:    Strict   [ens192 10.10.1.23]
Host firewall:           Disabled
CNI Chaining:            portmap
Cilium:                  Ok   1.12.4 (v1.12.4-6eaecaf)
NodeMonitor:             Listening for events on 4 CPUs with 64x4096 of shared memory
Cilium health daemon:    Ok  
IPAM:                    IPv4: 19/254 allocated from 10.42.2.0/24,
BandwidthManager:        Disabled
Host Routing:            BPF
Masquerading:            BPF   [ens192]   10.42.2.0/24 [IPv4: Enabled, IPv6: Disabled]
Controller Status:       104/104 healthy
Proxy Status:            OK, ip 10.42.2.10, 13 redirects active on ports 10000-20000
Global Identity Range:   min 256, max 65535
Hubble:                  Ok   Current/Max Flows: 4095/4095 (100.00%), Flows/s: 53.41   Metrics: Disabled
Encryption:              Disabled
Cluster health:          11/11 reachable   (2023-01-10T10:50:05Z)

I’m at your disposal if you need more information Thanks !

try to change this:

  - podSelector:
      matchLabels:
        io.kompose.service: charon-1

to this:

  selectors:
  - namespaceSelector:
      matchLabels:
        ns: charon-1

for us works.

I will try to follow up with somebody on Slack.

It’s not stale. Still an acute issue.