falco: Falco runtime error in k8s_replicationcontroller_handler_state for large k8s clusters (400+ nodes)
Describe the bug
We upgraded from falco:0.28.1 to falco:0.31.0 due to this bug in large k8s environments and we seem to have hit a new runtime error. We’re now seeing:
* Setting up /usr/src links from host
* Running falco-driver-loader for: falco version=0.31.0, driver version=319368f1ad778691164d33d59945e00c5752cd27
* Running falco-driver-loader with: driver=module, compile=yes, download=yes
* Unloading falco module, if present
* Looking for a falco module locally (kernel 5.4.149-73.259.amzn2.x86_64)
* Trying to download a prebuilt falco module from https://download.falco.org/driver/319368f1ad778691164d33d59945e00c5752cd27/falco_amazonlinux2_5.4.149-73.259.amzn2.x86_64_1.ko
* Download succeeded
* Success: falco module found and inserted
Rules match ignored syscall: warning (ignored-evttype):
loaded rules match the following events: access,brk,close,cpu_hotplug,drop,epoll_wait,eventfd,fcntl,fstat,fstat64,futex,getcwd,getdents,getdents64,getegid,geteuid,getgid,getpeername,getresgid,getresuid,getrlimit,getsockname,getsockopt,getuid,infra,k8s,llseek,lseek,lstat,lstat64,mesos,mmap,mmap2,mprotect,munmap,nanosleep,notification,page_fault,poll,ppoll,pread,preadv,procinfo,pwrite,pwritev,read,readv,recv,recvmmsg,select,semctl,semget,semop,send,sendfile,sendmmsg,setrlimit,shutdown,signaldeliver,splice,stat,stat64,switch,sysdigevent,timerfd_create,write,writev;
but these events are not returned unless running falco with -A
2022-02-17T22:44:13+0000: Runtime error: SSL Socket handler (k8s_replicationcontroller_handler_state): Connection closed.. Exiting.
We downgraded to falco:0.30.0 which does not have the runtime error.
How to reproduce it
Upgrade to falco:0.31.0 and scale your Kubernetes cluster to around 400 nodes.
Expected behaviour
No runtime error
Screenshots
Environment
- Falco version: 0.31.0
- System info:
- Cloud provider or hardware configuration: EKS v1.21.2 / ec2 instance size - r5dn.4xlarge
- OS: Amazon Linux 2
- Kernel: 5.4.149-73.259.amzn2.x86_64
- Installation method: Kubernetes
Additional context
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 30 (9 by maintainers)
I discussed this issue on the Falco Community Call today, so I’m sharing some of the information from that call for others who may be impacted.
As a workaround, you can consider removing the “-k <url>” command-line option. I was under the impression that this option was used to grab all the (non-audit) k8s.* metadata, but this is not the case. With or without this switch, Falco will pull a subset of information from the local kubelet API (perhaps based on the uppercase -K switch, but I’m unsure). Without the lowercase “-k” switch, Falco will not be able to retrieve some metadata that is only available from the cluster API, which I believe to be the following field types (from https://falco.org/docs/rules/supported-fields/): k8s.rc.* k8s.svc.* k8s.rs.* k8s.deployment.*
Check your rules to determine whether you are using any of these, and if not, you can probably remove that switch as a workaround and get yourself back up and running until this is fixed.
I was able to solve this issue by cleaning up the old replicasets (had about 5k of these)
I am seeing similar issue,
This workaround worked for us. Thanks!
falco: v0.32.2 Kubernetes(EKS): v1.21.14
either, have the same experience with 5.9.2022
:latestand 0.32.2 - GCP cluster, 25nodes.The workaround worked. The number of delete old ReplicaSets was 2600. Then all fine. Deleting them manually on other environments is not an option!
I am using
--k8s-nodefilter option, but I suspect falco does not reflect that when reading these replicasets… seeSome other pods managed to read only
506RS and then fails.The only error it throws and not recover is:
Again the setup:
We have the same issue with 0.32.1 Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed… Exiting
Hi @epcim, since Falco 0.32.1 you can have more debug info by adding the following args to Falco:
I actually got the same issue even with less nodes (25/30):
Runtime error: SSL Socket handler (k8s_replicationcontroller_handler_state): Connection closed.. Exiting.Environment:
Falco version: 0.31.1 Openshift: 4.8.35