falco: Falco runtime error in k8s_replicationcontroller_handler_state for large k8s clusters (400+ nodes)

Describe the bug

We upgraded from falco:0.28.1 to falco:0.31.0 due to this bug in large k8s environments and we seem to have hit a new runtime error. We’re now seeing:

* Setting up /usr/src links from host
* Running falco-driver-loader for: falco version=0.31.0, driver version=319368f1ad778691164d33d59945e00c5752cd27
* Running falco-driver-loader with: driver=module, compile=yes, download=yes
* Unloading falco module, if present
* Looking for a falco module locally (kernel 5.4.149-73.259.amzn2.x86_64)
* Trying to download a prebuilt falco module from https://download.falco.org/driver/319368f1ad778691164d33d59945e00c5752cd27/falco_amazonlinux2_5.4.149-73.259.amzn2.x86_64_1.ko
* Download succeeded
* Success: falco module found and inserted
Rules match ignored syscall: warning (ignored-evttype):
         loaded rules match the following events: access,brk,close,cpu_hotplug,drop,epoll_wait,eventfd,fcntl,fstat,fstat64,futex,getcwd,getdents,getdents64,getegid,geteuid,getgid,getpeername,getresgid,getresuid,getrlimit,getsockname,getsockopt,getuid,infra,k8s,llseek,lseek,lstat,lstat64,mesos,mmap,mmap2,mprotect,munmap,nanosleep,notification,page_fault,poll,ppoll,pread,preadv,procinfo,pwrite,pwritev,read,readv,recv,recvmmsg,select,semctl,semget,semop,send,sendfile,sendmmsg,setrlimit,shutdown,signaldeliver,splice,stat,stat64,switch,sysdigevent,timerfd_create,write,writev;
         but these events are not returned unless running falco with -A
2022-02-17T22:44:13+0000: Runtime error: SSL Socket handler (k8s_replicationcontroller_handler_state): Connection closed.. Exiting.

We downgraded to falco:0.30.0 which does not have the runtime error.

How to reproduce it

Upgrade to falco:0.31.0 and scale your Kubernetes cluster to around 400 nodes.

Expected behaviour

No runtime error

Screenshots

Environment

Falco version: 0.31.0

System info:

Cloud provider or hardware configuration: EKS v1.21.2 / ec2 instance size - r5dn.4xlarge
OS: Amazon Linux 2

Kernel: 5.4.149-73.259.amzn2.x86_64

Installation method: Kubernetes

Additional context

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 30 (9 by maintainers)

Most upvoted comments

I discussed this issue on the Falco Community Call today, so I’m sharing some of the information from that call for others who may be impacted.

As a workaround, you can consider removing the “-k <url>” command-line option. I was under the impression that this option was used to grab all the (non-audit) k8s.* metadata, but this is not the case. With or without this switch, Falco will pull a subset of information from the local kubelet API (perhaps based on the uppercase -K switch, but I’m unsure). Without the lowercase “-k” switch, Falco will not be able to retrieve some metadata that is only available from the cluster API, which I believe to be the following field types (from https://falco.org/docs/rules/supported-fields/): k8s.rc.* k8s.svc.* k8s.rs.* k8s.deployment.*

Check your rules to determine whether you are using any of these, and if not, you can probably remove that switch as a workaround and get yourself back up and running until this is fixed.

IanRobertson-wpe on Sep 7, 2022

I was able to solve this issue by cleaning up the old replicasets (had about 5k of these)

jefimm on Aug 13, 2022

I am seeing similar issue,

2022-05-12T01:16:36+0000: Runtime error: SSL Socket handler (k8s_namespace_handler_state): Connection closed.. Exiting.

Running falco-driver-loader for: falco version=0.31.0, driver version=319368f1ad778691164d33d59945e00c5752cd27
Running falco-driver-loader with: driver=bpf, compile=yes, download=yes

vnandha on May 12, 2022

As a workaround, you can consider removing the "-k " command-line option.

This workaround worked for us. Thanks!

falco: v0.32.2 Kubernetes(EKS): v1.21.14

EigoOda on Sep 8, 2022

either, have the same experience with 5.9.2022 :latest and 0.32.2 - GCP cluster, 25nodes.

The workaround worked. The number of delete old ReplicaSets was 2600. Then all fine. Deleting them manually on other environments is not an option!

I am using --k8s-node filter option, but I suspect falco does not reflect that when reading these replicasets… see

# parsing these lines.. [libs]: K8s [ADDED, ReplicaSet, ...........
❯ k logs -n monitoring falco-bk4t5 -c falco -p |grep ReplicaSet | wc -l
2071

Some other pods managed to read only 506 RS and then fails.

The only error it throws and not recover is:

Mon Sep  5 12:57:15 2022: [libs]: Error fetching K8s data: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.
Mon Sep  5 12:57:15 2022: Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.. Exiting.

Again the setup:

    spec:
      containers:
      - args:
        - /usr/bin/falco
        - --cri
        - /run/containerd/containerd.sock
        - --cri
        - /run/crio/crio.sock
        - -K
        - /var/run/secrets/kubernetes.io/serviceaccount/token
        - -k
        - https://$(KUBERNETES_SERVICE_HOST)
        - --k8s-node
        - $(FALCO_K8S_NODE_NAME)
        - -pk
        - -o
        - libs_logger.enabled=true
        - -o
        - libs_logger.severity=info
        env:
        - name: FALCO_BPF_PROBE
        - name: FALCO_K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - ```

@jasondellaluce could this have some attention? As this is real blocker. Even it works now on older , even small, clusters this will break any falco deployemnt.

@mac-abdon could you please remove (400+ nodes from title and mention it somewhere else).

epcim on Sep 6, 2022

We have the same issue with 0.32.1 Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed… Exiting

ranjithmr on Aug 24, 2022

Hi @epcim, since Falco 0.32.1 you can have more debug info by adding the following args to Falco:

-o libs_logger.enabled=true -o libs_logger.severity=trace

jasondellaluce on Jul 26, 2022

I actually got the same issue even with less nodes (25/30): Runtime error: SSL Socket handler (k8s_replicationcontroller_handler_state): Connection closed.. Exiting.

Environment:

Falco version: 0.31.1 Openshift: 4.8.35

Diliz on May 4, 2022