vector: Memory leak while using kubernetes_logs source

Vector Version

0.11.1-alpine

running on Kubernetes

v1.19.3

Vector Configuration File


    [sources.kube_logs_1]
      type = "kubernetes_logs"
      exclude_paths_glob_patterns = ["**/calico-node/**"]
      annotation_fields.container_name = "container_name"
      annotation_fields.container_image = "container_image"
      annotation_fields.pod_ip = "pod_ip"
      annotation_fields.pod_name = "pod_name"
      annotation_fields.pod_namespace = "namespace_name"
      annotation_fields.pod_node_name = "pod_node_name"
      annotation_fields.pod_uid = "pod_uid"

    [transforms.kube_logs_remapped]
      type = "remap"
      inputs = ["kube_logs_1"]
      source = '''
      .full_message = .message
      .message = "pod_stderr_stdout"
      .source = .pod_node_name
      del(.file)
      del(.pod_node_name)
      del(.kubernetes)
      '''
    [transforms.add_region_fields_to_logs]
      type = "add_fields"
      inputs = ["vector_1", "kube_logs_remapped"]
      fields.env = "<%= @environment %>"
      fields.region = "<%= @region %>"
      fields.region_domain = "<%= @region_domain %>"

    [sinks.graylog_gelf]
      type = "http"
      inputs = ["add_region_fields_to_logs"]
      uri = xxx
      encoding.codec = "ndjson"
      compression = "none"
      batch.max_bytes = 4096
      #batch.timeout_secs = 1
      #buffer.max_events = 1
      #buffer.type = "memory"
      tls.verify_hostname = false

Debug Output

https://gist.github.com/karlmartink/095979bec3dea7d91430c91a842d3927

Expected Behavior

Stable memory usage or slight increase until some maximum level is increased.

Actual Behavior

Containers memory + cpu usage is slowly growing overtime until maximum limits are reached and service is killed while processed events remain the same.

Example Data

POD CPU and Memory usage Screenshot 2021-03-09 at 15 14 19

Processed events Screenshot 2021-03-09 at 15 14 28

Additional Context

I am using vector to mostly collect Kubernetes logs and manipulate them with some transforms. After that they get forwarded to graylog via HTTP to GELF input.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 26 (13 by maintainers)

Commits related to this issue

Most upvoted comments

This can be closed from my side. If there is anything else I can help with testing in the future I am happy to do so.

@karlmartink excellent news! If you’re satisfied I’ll close this ticket out then.

@jszwedko yes, agreed. I think some of that is explicable by allocator fragmentation but I’ll do a long massif run to get more concrete details, at least for the minimal configs we used to diagnose the more serious leak here.

@blt that’s awesome! I deployed the new nightly version and will monitor the resource usage over the weekend. Will report back soon.

@karlmartink well, good news. I am reasonably certain that #7014 will address a good portion if not the whole of your problem. Our nightly build job starts at 4 AM UTC so by 6AM UTC there ought to be a build available with #7014 included. When you get a chance can you try out the nightly and let us know how it goes?

Ah, yep, I believe I see the problem. We bridge our internal and metric-rs metrics here:

https://github.com/timberio/vector/blob/267265a739bfd33f8b45d2fbcbf8155571bd7524/src/metrics/mod.rs#L118-L126

The important bit is that from_metric_kv:

https://github.com/timberio/vector/blob/267265a739bfd33f8b45d2fbcbf8155571bd7524/src/event/metric.rs#L292-L316

We’re calling read_histogram here which leaves the underlying samples in place – the metrics-rs histogram is a linked list of fixed sized blocks of samples that grows as you add more samples – where metrics-rs own exporters call read_histogram_with_clear, a function that clears up the internal space of the histogram. Experimentation shows this doesn’t quite do the trick but it does help some. We’re behind current metrics-rs and may be hitting a but that upstream has fixed.

@KHiis we are actively investigating and will report back as we learn. We’re hoping to get to the root cause next week.