cilium: Memory leak with FQDN policies

Bug report

General Information

Cilium version (run cilium version)

Client: 1.9.5 079bdaf 2021-03-10T13:12:19-08:00 go version go1.15.8 linux/amd64
Daemon: 1.9.5 079bdaf 2021-03-10T13:12:19-08:00 go version go1.15.8 linux/amd6

Kernel version (run uname -a)

Linux 5.8.0-40-lowlatency #45~20.04.1-Ubuntu SMP PREEMPT Fri Jan 15 12:34:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Orchestration system version in use (e.g. kubectl version, …)

v1.20.4+k3s1

Generate and upload a system zip: cilium-sysdump-20210525-150354.zip

Description

One pod has memory leak. It happened a few days ago with another node(pod), restart pod had help

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 33 (32 by maintainers)

Most upvoted comments

@yuriydzobak Thanks, so I’m assuming you’re using at least DNS policies? How many policies and what does the application activity look like in terms of DNS?

Yes, and coredns-nodecache with local-redirect

---
apiVersion: "cilium.io/v2"
kind: CiliumLocalRedirectPolicy
metadata:
  name: "nodelocaldns"
  namespace: kube-system
spec:
  redirectFrontend:
    serviceMatcher:
      serviceName: kube-dns
      namespace: kube-system
  redirectBackend:
    localEndpointSelector:
      matchLabels:
        k8s-app: node-local-dns
    toPorts:
      - port: "53"
        name: dns
        protocol: UDP
      - port: "53"
        name: dns-tcp
        protocol: TCP

and

apiVersion: "cilium.io/v2"
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: 07-allow-kube-dns
specs:
  - description: "Policy for ingress allow to kube-dns from all PODs in the cluster"
    endpointSelector:
      matchLabels:
        k8s:io.kubernetes.pod.namespace: kube-system
        k8s-app: kube-dns
    ingress:
      - fromEndpoints:
          - {}
        toPorts:
          - ports:
              - port: "53"
                protocol: ANY
  - description: "Policy for ingress allow to coredns-nodecache from all PODs in the cluster"
    endpointSelector:
      matchLabels:
        k8s:io.kubernetes.pod.namespace: kube-system
        k8s-app: coredns-nodecache
    ingress:
      - fromEndpoints:
          - {}
        toPorts:
          - ports:
              - port: "53"
                protocol: ANY
  - description: "Policy for egress allow from any PODs in the cluster to kube-dns"
    endpointSelector: {}
    egress:
      - toEndpoints:
          - matchLabels:
              k8s:io.kubernetes.pod.namespace: kube-system
              k8s:k8s-app: kube-dns
        toPorts:
          - ports:
              - port: "53"
                protocol: ANY
            rules:
              dns:
                - matchPattern: "*"
  - description: "Policy for egress allow from any PODs in the cluster to coredns-nodecache"
    endpointSelector: {}
    egress:
      - toEndpoints:
          - matchLabels:
              k8s:io.kubernetes.pod.namespace: kube-system
              k8s:k8s-app: coredns-nodecache
        toPorts:
          - ports:
              - port: "53"
                protocol: ANY
            rules:
              dns:
                - matchPattern: "*"
  - description: "Policy for egress allow from any PODs in the cluster to node-local-dns"
    endpointSelector: {}
    egress:
      - toEndpoints:
          - matchLabels:
              k8s:io.kubernetes.pod.namespace: kube-system
              k8s:k8s-app: node-local-dns
        toPorts:
          - ports:
              - port: "53"
                protocol: ANY
            rules:
              dns:

application CNP

      - toFQDNs:
          - matchName: "example.domain.com"
        toPorts:
          - ports:
              - port: "80"
                protocol: TCP
              - port: "443"
                protocol: TCP

Also, application uses s3 and service with exteranalName

logstash                                         ExternalName   <none>          logstash.domain.com

i see, time to time but i think it’s ok

....
....
level=info msg="FQDN garbage collector work deleted N name entries: echo-a.monitoring.svc.cluster.local.,www.google.com.,google.com..,logstash.dc01.lf.,logstash.monitoring.svc.cluster.local.,notification-controller.gotk-system.svc.cluster.local." controller=dns-garbage-collector-job subsys=daemon
......
......

yuriydzobak on Aug 4, 2021

@yuriydzobak can you check with 1.10.2 to see if the issue still persists?

aanm on Jul 5, 2021

Let’s wait until #16236 is also fixed then

aanm on Jun 4, 2021

@aanm seems, the bug exists in 1.9.8

yuriydzobak on Jun 4, 2021

I updated сшдшгь to version 1.9.8 on two clusters, I think we need to wait a couple of days

yuriydzobak on Jun 3, 2021