vector: Memory leak while using kubernetes_logs source
Vector Version
0.11.1-alpine
running on Kubernetes
v1.19.3
Vector Configuration File
[sources.kube_logs_1]
type = "kubernetes_logs"
exclude_paths_glob_patterns = ["**/calico-node/**"]
annotation_fields.container_name = "container_name"
annotation_fields.container_image = "container_image"
annotation_fields.pod_ip = "pod_ip"
annotation_fields.pod_name = "pod_name"
annotation_fields.pod_namespace = "namespace_name"
annotation_fields.pod_node_name = "pod_node_name"
annotation_fields.pod_uid = "pod_uid"
[transforms.kube_logs_remapped]
type = "remap"
inputs = ["kube_logs_1"]
source = '''
.full_message = .message
.message = "pod_stderr_stdout"
.source = .pod_node_name
del(.file)
del(.pod_node_name)
del(.kubernetes)
'''
[transforms.add_region_fields_to_logs]
type = "add_fields"
inputs = ["vector_1", "kube_logs_remapped"]
fields.env = "<%= @environment %>"
fields.region = "<%= @region %>"
fields.region_domain = "<%= @region_domain %>"
[sinks.graylog_gelf]
type = "http"
inputs = ["add_region_fields_to_logs"]
uri = xxx
encoding.codec = "ndjson"
compression = "none"
batch.max_bytes = 4096
#batch.timeout_secs = 1
#buffer.max_events = 1
#buffer.type = "memory"
tls.verify_hostname = false
Debug Output
https://gist.github.com/karlmartink/095979bec3dea7d91430c91a842d3927
Expected Behavior
Stable memory usage or slight increase until some maximum level is increased.
Actual Behavior
Containers memory + cpu usage is slowly growing overtime until maximum limits are reached and service is killed while processed events remain the same.
Example Data
POD CPU and Memory usage

Processed events

Additional Context
I am using vector to mostly collect Kubernetes logs and manipulate them with some transforms. After that they get forwarded to graylog via HTTP to GELF input.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 26 (13 by maintainers)
This can be closed from my side. If there is anything else I can help with testing in the future I am happy to do so.
@karlmartink excellent news! If you’re satisfied I’ll close this ticket out then.
@jszwedko yes, agreed. I think some of that is explicable by allocator fragmentation but I’ll do a long massif run to get more concrete details, at least for the minimal configs we used to diagnose the more serious leak here.
@blt that’s awesome! I deployed the new nightly version and will monitor the resource usage over the weekend. Will report back soon.
@karlmartink well, good news. I am reasonably certain that #7014 will address a good portion if not the whole of your problem. Our nightly build job starts at 4 AM UTC so by 6AM UTC there ought to be a build available with #7014 included. When you get a chance can you try out the nightly and let us know how it goes?
Ah, yep, I believe I see the problem. We bridge our internal and metric-rs metrics here:
https://github.com/timberio/vector/blob/267265a739bfd33f8b45d2fbcbf8155571bd7524/src/metrics/mod.rs#L118-L126
The important bit is that
from_metric_kv:https://github.com/timberio/vector/blob/267265a739bfd33f8b45d2fbcbf8155571bd7524/src/event/metric.rs#L292-L316
We’re calling
read_histogramhere which leaves the underlying samples in place – the metrics-rs histogram is a linked list of fixed sized blocks of samples that grows as you add more samples – where metrics-rs own exporters callread_histogram_with_clear, a function that clears up the internal space of the histogram. Experimentation shows this doesn’t quite do the trick but it does help some. We’re behind current metrics-rs and may be hitting a but that upstream has fixed.@KHiis we are actively investigating and will report back as we learn. We’re hoping to get to the root cause next week.