falco: OOM on physical servers

Describe the bug

Pn 0.34.x releases we do experience mem leak on physical instances, while the same setup on AWS is fine. It could be due node workload, but still its clear mem leak.

Actually as of now RC not identified,

  • looking for help to do some mem profile or debug the issue
  • anyone with similar behavior?

How to reproduce it

This is bit customised deployment (not helm, etc.)

This is the config falco is given (we do use more rules, but the problem happens with only upstream ones (now the rules from rules repo)

data:
  falco.yaml: |
    rules_file:
      - /etc/falco-upstream/falco_rules.yaml                          
      - /etc/falco/rules.d
    
    plugins:
    - name: json
      library_path: libjson.so
      init_config: ""
    
    load_plugins: []
    watch_config_files: true
    time_format_iso_8601: false
    
    
    json_include_output_property: true
    json_include_tags_property: true
    json_output: true
    log_stderr: true
    log_syslog: false
    # "alert", "critical", "error", "warning", "notice", "info", "debug".
    log_level: error
    libs_logger:
      enabled: false
      severity: debug # "info", "debug", "trace".
    priority: warning
    
    buffered_outputs: false
    syscall_buf_size_preset: 4
    syscall_event_drops:
      threshold: 0.1
      actions:
        - log
      rate: 0.03333
      max_burst: 1
      simulate_drops: false
    
    
    syscall_event_timeouts:
      max_consecutives: 1000
    
    webserver:
      enabled: true
      k8s_healthz_endpoint: /healthz
      listen_port: 64765
      ssl_enabled: false
      ssl_certificate: /volterra/secrets/identity/server.crt
      threadiness: 0
      #k8s_audit_endpoint: /k8s-audit
    
    output_timeout: 2000
    outputs:
      rate: 1
      max_burst: 1000
    syslog_output:
      enabled: false
    file_output:
      enabled: false
      keep_alive: false
      filename: ./events.txt
    stdout_output:
      enabled: true
    program_output:
      enabled: false
      keep_alive: false
      program: "jq '{text: .output}' | curl -d @- -X POST https://hooks.slack.com/services/XXX"
    http_output:
      enabled: true
      url: "http://falco-sidekick.monitoring.svc.cluster.local:64801/"
      user_agent: falcosecurity/falco
    grpc:
      enabled: false
      bind_address: unix:///run/falco/falco.sock
      threadiness: 0
    grpc_output:
      enabled: false
    
    metadata_download:
      max_mb: 100
      chunk_wait_us: 1000
      watch_freq_sec: 1
    
    modern_bpf:
      cpus_for_each_syscall_buffer: 2"

Expected behaviour

Drop memory at regular intervals

Screenshots

Cloud instances of falco on AWS: (ok behaviour, screenshot is imo on 0.33.x version) image

Instances on physical servers: ( OOM, on 0.34.1, the nodes in the cluster are exactly the same, though, only 2 of 4 are affected by mem increase (could be due specific workload). Surprisingly same metric does not match the pattern from AWS/GCP nodes (above) image

Environment

K8s, falco in container Physical server, under load

  • Falco version:
{"default_driver_version":"4.0.0+driver","driver_api_version":"3.0.0","driver_schema_version":"2.0.0","engine_version":"16","falco_version":"0.34.1","libs_version":"0.10.4","plugin_api_version":"2.0.0"}
  • System info:
{
  "machine": "x86_64",
  "nodename": "master-1",
  "release": "4.18.0-240.10.1.ves1.el7.x86_64",
  "sysname": "Linux",
  "version": "#1 SMP Tue Mar 30 15:02:49 UTC 2021"
}
  • Cloud provider or hardware configuration:
  • OS: /etc/os-release not relevant, it’s basically centos but customised
  • Kernel:
root@master-1:/# uname -a
Linux master-1 4.18.0-240.10.1.ves1.el7.x86_64 #1 SMP Tue Mar 30 15:02:49 UTC 2021 x86_64 GNU/Linux

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 34 (26 by maintainers)

Most upvoted comments

Simulated a noisy Falco config on my developer Linux box. Enabling most supported syscalls was sufficient to simulate memory issues:

- rule: test
  desc: test
  condition: evt.type!=close
  enabled: true
  output: '%evt.type %evt.num %proc.aname[5] %proc.name %proc.tty %proc.exepath %fd.name'
  priority: NOTICE

Using valgrind massif heap profiler:

sudo insmod driver/falco.ko
sudo valgrind --tool=massif \
         userspace/falco/falco -c ../../falco.yaml -r ../../falco_rules_test.yaml > /tmp/out

massif-visualizer massif.out.$PID

image

Reading the tbb API docs https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Concurrent_Queue_Classes.html, we use the following variant ... By default, a concurrent_bounded_queue is unbounded. It may hold any number of values, until memory runs out. ... and currently we do not set a safety capacity, or better expose it as parameter.

Here is a staging branch to correct this: https://github.com/incertum/falco/tree/queue-capacity-outputs, what do you all think?

However, the root cause is rather the entire event flow being too slow, basically we don’t get to pop in time from the queue in these extreme cases, because we are seeing timeouts and also noticed heavy kernel side drops. Basically the pipe is just not holding up when trying to monitor so many syscalls even just on a more or less idle laptop. I would suggest we should re-audit the entire Falco processing and outputs engine and look for improvement areas, because when I did the same profiling with the libs sinsp-example binary, memory and output logs were pretty stable over time …

Yes, upgraded to 0.36.0 last week. The falco container is still getting OOMkilled by kubernetes/cgroups (Last state: Terminated with 137: OOMKilled) with the default queue capacity config.

Unfortunately I have a hard time exposing the stats/metrics to our TSDB.

Hi @emilgelman thanks this is great news you have cgroups v2. By the way we now also have the base_syscalls config in falco.yaml for radical syscalls monitoring control, check it out.

However, I think here we need to investigate in different places more drastically (meaning going back to the drawing board) as it has also been reported for plugins only. In that case we merely do event filtering in libsinsp, so most of the libsinsp complexity does not apply which kind of narrows down the search space.

I am going to prioritize 👀 into it, it likely will take some time.


In addition, in case you are curious to learn more about the underlying libs and kernel drivers with respect to memory:

  • Yes we do build up a process cache table in libsinsp, but we also hook into the scheduler process exit tracepoint to purge items from the table again, else the memory would skyrocket in no time.
  • The same applies for the container engine, therefore I suspect it must be something much more subtle while still being event driven.
  • Then there is the discussion around absolute memory usage regardless of time drifts. For example, we learned the hard way that the new eBPF ring buffer wrongly accounts memory twice, check out our conversation with the kernel mailing list. Adjusting parameters such as syscall_buf_size_preset and modern_bpf.cpus_for_each_syscall_buffer can help, again this is just some more insights a bit unrelated to the fact that we are investigating subtle drifts over time in this issue. I am also still hoping to one day meet someone who knows all the answers re Linux kernel memory management and accounting, often it’s not even clear what the right metric is and if the metric is accounting memory in a meaningful way.

@incertum the host is running cgroups v2:

# stat -fc %T /sys/fs/cgroup/
cgroup2fs

I am experimenting with the effect of rules configuration on this. It seems that disabling all rules doesn’t reproduce the issue, so I’m trying to understand if I can isolate it to specific rule/s.