vector: kubernetes_logs source possibly missing logs in AKS/containerd cluster
Vector Version
0.14.0
Vector Configuration File
sources:
k8s:
type: kubernetes_logs
rawConfig: |
glob_minimum_cooldown_ms = 500
Expected Behavior
Vector should see and log all data that the multiple pods are logging to stdout.
Actual Behavior
Vector misses some logs that are created by the pods.
Additional Context
We’re currently doing performance testing to see if vector would be a good solution for our logging system.
In the course of this test we found out that some logs of our logging stress test pods were missing.
After investigation, we found out that changing glob_minimum_cooldown_ms to 500 ms instead of the default 4000 ms alleviates this issue.
We’ve also found out that our AKS containerd cluster structures the logs as follows:
/var/log/pods/wld2_xload-zb8gs_d9fbd436-8dda-468f-83e3-c8c8ef42b4dd/xload# ls
0.log
0.log.20210618-133010.gz
0.log.20210618-133020.gz
0.log.20210618-133031.gz
0.log.20210618-133041
What we think it’s happening is the following:
- Vector starts reading 0.log
- The file gets “full”
- The file gets rotated to 0.log.DATE
- The file 0.log gets cleared (empty)
We think there might be two problems:
- With the completely empty file in 4. , it might mess up vector check pointing system.
- While vector is working, between step 1. and 4., vector loses some data that was deleted once the 4. executes.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 16 (7 by maintainers)
This is on openshift, so cri-o is the container runtime. The log directory on the node looks like this:
Config looks like this:
Vector version: vector:0.11.1-alpine (I’ll see that I bump that today).
Thanks for looking into this!
That’s great to hear - for what it’s worth we think the throughput volume was causing the files to be rotated from
0.logto0.log.DATEbefore the source globbed for more files to watch, and right now the source is explicitly only looking for*.login that directory.Sorry, this is the public sharing link: https://lucid.app/lucidchart/invitations/accept/inv_2c2874c7-a202-4fa9-872d-e54705de517a
@spencergilbert We’ve tested if this behavior would stall the entire
kubernetes_logssource as follows:The outcome was as follows:
This behavior seems to indicate that the source does not get stalled. The spikes on the graph seem to be vector catching some logs from the pod that logs 200k messages. We tested this with the default
glob_minimum_cooldown_ms.We also tested the same test with
glob_minimum_cooldown_ms: 500which mitigated the problem.I’m generating events with flog so I should be able to crank up the throughput to something similar to what you’re seeing
The 25k line sample started at 17:13:54Z and ended at 17:19:01Z so that makes it about 69 message/second. It is very bursty though, so peak is likely much higher. There are logs from other components on the node at the same time.
At the time I noticed this, we didn’t have any shipped logs for the past 12 hours and deleting the vector pod to get it re-created made the shipping happen so I doubt that’s it. I just bumped vector in https://github.com/openshift/release/pull/19445 and right now there are logs but the “repro” is “wait a couple of hours” so that might be expected. I will check again later and update here.
In case that helps, here are the last 25k loglines of the app where I noticed that (and probably noteworthy: It didn’t seem to happen for others (but I believe this one emits more logs than everything else)): https://gist.github.com/alvaroaleman/e8e57e62435fcf62b04ba5498e5e1857