vector: Memory leak while using kubernetes_logs source

Vector Version

0.11.1-alpine

running on Kubernetes

v1.19.3

Vector Configuration File


    [sources.kube_logs_1]
      type = "kubernetes_logs"
      exclude_paths_glob_patterns = ["**/calico-node/**"]
      annotation_fields.container_name = "container_name"
      annotation_fields.container_image = "container_image"
      annotation_fields.pod_ip = "pod_ip"
      annotation_fields.pod_name = "pod_name"
      annotation_fields.pod_namespace = "namespace_name"
      annotation_fields.pod_node_name = "pod_node_name"
      annotation_fields.pod_uid = "pod_uid"

    [transforms.kube_logs_remapped]
      type = "remap"
      inputs = ["kube_logs_1"]
      source = '''
      .full_message = .message
      .message = "pod_stderr_stdout"
      .source = .pod_node_name
      del(.file)
      del(.pod_node_name)
      del(.kubernetes)
      '''
    [transforms.add_region_fields_to_logs]
      type = "add_fields"
      inputs = ["vector_1", "kube_logs_remapped"]
      fields.env = "<%= @environment %>"
      fields.region = "<%= @region %>"
      fields.region_domain = "<%= @region_domain %>"

    [sinks.graylog_gelf]
      type = "http"
      inputs = ["add_region_fields_to_logs"]
      uri = xxx
      encoding.codec = "ndjson"
      compression = "none"
      batch.max_bytes = 4096
      #batch.timeout_secs = 1
      #buffer.max_events = 1
      #buffer.type = "memory"
      tls.verify_hostname = false

Debug Output

https://gist.github.com/karlmartink/095979bec3dea7d91430c91a842d3927

Expected Behavior

Stable memory usage or slight increase until some maximum level is increased.

Actual Behavior

Containers memory + cpu usage is slowly growing overtime until maximum limits are reached and service is killed while processed events remain the same.

Example Data

POD CPU and Memory usage Screenshot 2021-03-09 at 15 14 19

Processed events Screenshot 2021-03-09 at 15 14 28

Additional Context

I am using vector to mostly collect Kubernetes logs and manipulate them with some transforms. After that they get forwarded to graylog via HTTP to GELF input.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 26 (13 by maintainers)

Commits related to this issue

chore: Use our own metrics-rs Handle implementation (#7014) This commit introduces our own metrics-rs Handle implementation, one that takes constant space per metric type. This significantly reduces... — committed to vectordotdev/vector by blt 3 years ago

Most upvoted comments

This can be closed from my side. If there is anything else I can help with testing in the future I am happy to do so.

karlmartink on Apr 13, 2021

@karlmartink excellent news! If you’re satisfied I’ll close this ticket out then.

@jszwedko yes, agreed. I think some of that is explicable by allocator fragmentation but I’ll do a long massif run to get more concrete details, at least for the minimal configs we used to diagnose the more serious leak here.

blt on Apr 12, 2021

@blt that’s awesome! I deployed the new nightly version and will monitor the resource usage over the weekend. Will report back soon.

karlmartink on Apr 9, 2021

@karlmartink well, good news. I am reasonably certain that #7014 will address a good portion if not the whole of your problem. Our nightly build job starts at 4 AM UTC so by 6AM UTC there ought to be a build available with #7014 included. When you get a chance can you try out the nightly and let us know how it goes?

blt on Apr 8, 2021

Ah, yep, I believe I see the problem. We bridge our internal and metric-rs metrics here:

https://github.com/timberio/vector/blob/267265a739bfd33f8b45d2fbcbf8155571bd7524/src/metrics/mod.rs#L118-L126

The important bit is that from_metric_kv:

https://github.com/timberio/vector/blob/267265a739bfd33f8b45d2fbcbf8155571bd7524/src/event/metric.rs#L292-L316

We’re calling read_histogram here which leaves the underlying samples in place – the metrics-rs histogram is a linked list of fixed sized blocks of samples that grows as you add more samples – where metrics-rs own exporters call read_histogram_with_clear, a function that clears up the internal space of the histogram. Experimentation shows this doesn’t quite do the trick but it does help some. We’re behind current metrics-rs and may be hitting a but that upstream has fixed.

blt on Apr 5, 2021

@KHiis we are actively investigating and will report back as we learn. We’re hoping to get to the root cause next week.

binarylogic on Mar 19, 2021