falco: Falco would occasionally restart due to connection error to kubernetes api service

What happened: Falco agents would occasionally restart due to a connection error to kubernetes service at kubernetes.default.svc.cluster.local (10.195.238.1) It seems to happen more frequently especially on the busy clusters.

...
Sun Nov 17 03:12:13 2019: Runtime error: Error during connection attempt to https://10.195.238.1 (socket=122, error=111): Connection refused. Exiting.

...
Sun Nov 17 03:08:02 2019: Runtime error: Socket handler (k8s_namespace_handler_event) an error occurred while connecting to https://10.195.238.1: Connection refused. Exiting.

What you expected to happen: To run without overloading kubernetes api service too much (if that is what is causing the connection issue to the api server)

How to reproduce it (as minimally and precisely as possible): We deployed Falco agents (as DaemonSet) to each node (running on COS) of a cluster on GKE. This cluster has around 20 nodes with roughly 20 pods running each. We have two customized rules that will aggressively report on any process executions and file write accesses.

Anything else we need to know?: Initially, we suspected that our aggressive two rules cause falco to hit metadata service too much that it’ll become unavailable at times.

We noticed two kinds of network traffic coming from the agent to kubernetes services (kubernetes.default.svc.cluster.local, kube-dns.kube-system.svc.cluster.local). One kind of traffic which Falco seems to resolve domain names of mining pool every five minutes,

20:56:23.256627 IP falco-daemonset-nxb72.38144 > kube-dns.kube-system.svc.cluster.local.53: 54927+ A? mine.moneropool.com.falco.svc.cluster.local. (61)
20:56:23.258072 IP kube-dns.kube-system.svc.cluster.local.53 > falco-daemonset-nxb72.38144: 54927 NXDomain 0/1/0 (154)
20:56:23.258194 IP falco-daemonset-nxb72.38144 > kube-dns.kube-system.svc.cluster.local.53: 18838+ AAAA? mine.moneropool.com.falco.svc.cluster.local. (61)
20:56:23.259073 IP kube-dns.kube-system.svc.cluster.local.53 > falco-daemonset-nxb72.38144: 18838 NXDomain 0/1/0 (154)
20:56:23.260583 IP falco-daemonset-nxb72.35302 > kube-dns.kube-system.svc.cluster.local.53: 7232+ A? mine.moneropool.com.svc.cluster.local. (55)
20:56:23.261660 IP kube-dns.kube-system.svc.cluster.local.53 > falco-daemonset-nxb72.35302: 7232 NXDomain 0/1/0 (148)

Another kind of traffic which hits the kubernetes service many times a second. Considering we have 20 agents running in this cluster, the number of requests would be amplified.

00:34:51.359118 IP kubernetes.default.svc.cluster.local.443 > falco-daemonset-6xq4c.60896: Flags [P.], seq 2129670:2133795, ack 1, win 285, options [nop,nop,TS val 3858423463 ecr 1533327183], length 4125
00:34:51.359131 IP falco-daemonset-6xq4c.60896 > kubernetes.default.svc.cluster.local.443: Flags [.], ack 2133795, win 1945, options [nop,nop,TS val 1533328911 ecr 3858423463], length 0
00:34:51.359135 IP kubernetes.default.svc.cluster.local.443 > falco-daemonset-6xq4c.60896: Flags [P.], seq 2133795:2139997, ack 1, win 285, options [nop,nop,TS val 3858423463 ecr 1533327183], length 6202
00:34:51.359139 IP falco-daemonset-6xq4c.60896 > kubernetes.default.svc.cluster.local.443: Flags [.], ack 2139997, win 1911, options [nop,nop,TS val 1533328911 ecr 3858423463], length 0
00:34:51.359259 IP kubernetes.default.svc.cluster.local.443 > falco-daemonset-6xq4c.60896: Flags [P.], seq 2139997:2140028, ack 1, win 285, options [nop,nop,TS val 3858423463 ecr 1533328911], length 31
00:34:51.359262 IP falco-daemonset-6xq4c.60896 > kubernetes.default.svc.cluster.local.443: Flags [.], ack 2140028, win 1911, options [nop,nop,TS val 1533328911 ecr 3858423463], length 0
00:34:51.637833 IP kubernetes.default.svc.cluster.local.443 > falco-daemonset-6xq4c.60896: Flags [P.], seq 2140028:2144153, ack 1, win 285, options [nop,nop,TS val 3858423742 ecr 1533328911], length 4125
00:34:51.637863 IP falco-daemonset-6xq4c.60896 > kubernetes.default.svc.cluster.local.443: Flags [.], ack 2144153, win 1879, options [nop,nop,TS val 1533329189 ecr 3858423742], length 0
00:34:51.637868 IP kubernetes.default.svc.cluster.local.443 > falco-daemonset-6xq4c.60896: Flags [P.], seq 2144153:2149731, ack 1, win 285, options [nop,nop,TS val 3858423742 ecr 1533328911], length 5578
00:34:51.637871 IP falco-daemonset-6xq4c.60896 > kubernetes.default.svc.cluster.local.443: Flags [.], ack 2149731, win 1841, options [nop,nop,TS val 1533329190 ecr 3858423742], length 0
00:34:51.638019 IP kubernetes.default.svc.cluster.local.443 > falco-daemonset-6xq4c.60896: Flags [P.], seq 2149731:2149762, ack 1, win 285, options [nop,nop,TS val 3858423742 ecr 1533329189], length 31
00:34:51.638024 IP falco-daemonset-6xq4c.60896 > kubernetes.default.svc.cluster.local.443: Flags [.], ack 2149762, win 1841, options [nop,nop,TS val 1533329190 ecr 3858423742], length 0

We tried configuring kubernetes dnsConfig.ndots to 1 to avoid any unnecessary local searches of domains to kube-dns, and coincidentally, the connection refusals have disappeared since then.

Environment:

Falco version: falco-probe-bpf-0.18.0-x86_64-4.14.137
System info

# falco --support | jq .system_info
Fri Nov 22 19:21:42 2019: Falco initialized with configuration file /etc/falco/falco.yaml
Fri Nov 22 19:21:42 2019: Loading rules from file /etc/falco/falco_rules.yaml:
Fri Nov 22 19:21:42 2019: Loading rules from file /etc/falco/falco_rules.local.yaml:
{
  "machine": "x86_64",
  "nodename": "falco-daemonset-nr758",
  "release": "4.14.137+",
  "sysname": "Linux",
  "version": "#1 SMP Thu Aug 8 02:47:02 PDT 2019"
}

Cloud provider or hardware configuration: GCP, n1-standard-4, Container-Optimized OS build 11647.267.0
OS (e.g: cat /etc/os-release):

PRETTY_NAME="Debian GNU/Linux bullseye/sid"
NAME="Debian GNU/Linux"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel (e.g. uname -a):

Linux falco-daemonset-nr758 4.14.137+ #1 SMP Thu Aug 8 02:47:02 PDT 2019 x86_64 GNU/Linux

Install tools (e.g. in kubernetes, rpm, deb, from source):

Kubernetes master node 1.13.11-gke.14
Kubernetes node 1.13.10-gke.0

Others:

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 3
Comments: 32 (5 by maintainers)

Most upvoted comments

Investigated this a bit more and after talking with @leodido we agreed that the problem here is that the library we use to enrich events with kubernetes metadata (libsinsp) does not have an extensive exception hierarchy to allow us to understand when an exception is thrown because of a connection error or something else.

Connection errors to kubernetes can happen (network can fail) but we should log the error instead of stopping Falco. At the moment we have no way to understand what are the types of errors that can stop Falco vs the types of errors that should not.

The interesting exception is here: https://github.com/draios/sysdig/blob/master/userspace/libsinsp/socket_handler.h#L1211

To solve this, we need to connect with the community behind libsinsp and make a plan with them to make the exceptions more blatant. This needs at least a discussion on the community call too because it can be very impacting on how Falco works.

In the meanwhile, what we suggest is to let Kubernetes restart Falco when the errors occur.

fntlnz on Jun 5, 2020

Thanks to @leogr I found what was causing my issues: the ClusterRole configuration! Running on GKE, this is the ClusterRole I’m using currently:

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: falco
rules:
  - nonResourceURLs:
      - /healthz
      - /healthz/*
    verbs:
      - get
  - apiGroups:
      - extensions
      - ""
    resources:
      - nodes
      - namespaces
      - pods
      - replicationcontrollers
      - replicasets
      - services
      - daemonsets
      - deployments
      - events
      - configmaps
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - apps
    resources:
      - daemonsets
      - deployments
      - replicasets
      - statefulsets
    verbs:
      - get
      - list
      - watch

Result: no restarts in the last 3 days 😃

Unfortunately Falco put out a log of a “connection timeout” but actually it ends up having no rights to get info about either deployment, job, statefulset or else. I think on Falco side can be improved the specific log, in order to better identify if it’s a bad communication with the k8s-api-server or else (like in my case missing of rights)

bygui86 on Feb 1, 2021

I’m have the same issue on GKE

2021-01-18T09:58:31+0000: Runtime error: Socket handler (k8s_replicationcontroller_handler_event) an error occurred while connecting to https://10.4.0.1: Connection timed out. Exiting.

falco version: 0.26.2 k8s version: 1.17.14-gke400

@fntlnz is there a way to increase the timeout?

/remove-lifecycle stale

bygui86 on Jan 18, 2021

yes please @axot - it would be useful!

fntlnz on Dec 16, 2019