cilium: Memory leak with FQDN policies

Bug report

General Information

  • Cilium version (run cilium version)
Client: 1.9.5 079bdaf 2021-03-10T13:12:19-08:00 go version go1.15.8 linux/amd64
Daemon: 1.9.5 079bdaf 2021-03-10T13:12:19-08:00 go version go1.15.8 linux/amd6
  • Kernel version (run uname -a)
Linux 5.8.0-40-lowlatency #45~20.04.1-Ubuntu SMP PREEMPT Fri Jan 15 12:34:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Orchestration system version in use (e.g. kubectl version, …)
v1.20.4+k3s1

Description

One pod has memory leak. It happened a few days ago with another node(pod), restart pod had help image

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 33 (32 by maintainers)

Most upvoted comments

@yuriydzobak Thanks, so I’m assuming you’re using at least DNS policies? How many policies and what does the application activity look like in terms of DNS?

Yes, and coredns-nodecache with local-redirect

---
apiVersion: "cilium.io/v2"
kind: CiliumLocalRedirectPolicy
metadata:
  name: "nodelocaldns"
  namespace: kube-system
spec:
  redirectFrontend:
    serviceMatcher:
      serviceName: kube-dns
      namespace: kube-system
  redirectBackend:
    localEndpointSelector:
      matchLabels:
        k8s-app: node-local-dns
    toPorts:
      - port: "53"
        name: dns
        protocol: UDP
      - port: "53"
        name: dns-tcp
        protocol: TCP

and

apiVersion: "cilium.io/v2"
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: 07-allow-kube-dns
specs:
  - description: "Policy for ingress allow to kube-dns from all PODs in the cluster"
    endpointSelector:
      matchLabels:
        k8s:io.kubernetes.pod.namespace: kube-system
        k8s-app: kube-dns
    ingress:
      - fromEndpoints:
          - {}
        toPorts:
          - ports:
              - port: "53"
                protocol: ANY
  - description: "Policy for ingress allow to coredns-nodecache from all PODs in the cluster"
    endpointSelector:
      matchLabels:
        k8s:io.kubernetes.pod.namespace: kube-system
        k8s-app: coredns-nodecache
    ingress:
      - fromEndpoints:
          - {}
        toPorts:
          - ports:
              - port: "53"
                protocol: ANY
  - description: "Policy for egress allow from any PODs in the cluster to kube-dns"
    endpointSelector: {}
    egress:
      - toEndpoints:
          - matchLabels:
              k8s:io.kubernetes.pod.namespace: kube-system
              k8s:k8s-app: kube-dns
        toPorts:
          - ports:
              - port: "53"
                protocol: ANY
            rules:
              dns:
                - matchPattern: "*"
  - description: "Policy for egress allow from any PODs in the cluster to coredns-nodecache"
    endpointSelector: {}
    egress:
      - toEndpoints:
          - matchLabels:
              k8s:io.kubernetes.pod.namespace: kube-system
              k8s:k8s-app: coredns-nodecache
        toPorts:
          - ports:
              - port: "53"
                protocol: ANY
            rules:
              dns:
                - matchPattern: "*"
  - description: "Policy for egress allow from any PODs in the cluster to node-local-dns"
    endpointSelector: {}
    egress:
      - toEndpoints:
          - matchLabels:
              k8s:io.kubernetes.pod.namespace: kube-system
              k8s:k8s-app: node-local-dns
        toPorts:
          - ports:
              - port: "53"
                protocol: ANY
            rules:
              dns:

application CNP

      - toFQDNs:
          - matchName: "example.domain.com"
        toPorts:
          - ports:
              - port: "80"
                protocol: TCP
              - port: "443"
                protocol: TCP

Also, application uses s3 and service with exteranalName

logstash                                         ExternalName   <none>          logstash.domain.com

i see, time to time but i think it’s ok

....
....
level=info msg="FQDN garbage collector work deleted N name entries: echo-a.monitoring.svc.cluster.local.,www.google.com.,google.com..,logstash.dc01.lf.,logstash.monitoring.svc.cluster.local.,notification-controller.gotk-system.svc.cluster.local." controller=dns-garbage-collector-job subsys=daemon
......
......

@yuriydzobak can you check with 1.10.2 to see if the issue still persists?

Let’s wait until #16236 is also fixed then

@aanm seems, the bug exists in 1.9.8 image

image

I updated сшдшгь to version 1.9.8 on two clusters, I think we need to wait a couple of days