vector: kubernetes_logs source possibly missing logs in AKS/containerd cluster

Vector Version

0.14.0

Vector Configuration File

sources: 
  k8s:
    type: kubernetes_logs
    rawConfig: |
      glob_minimum_cooldown_ms = 500

Expected Behavior

Vector should see and log all data that the multiple pods are logging to stdout.

Actual Behavior

Vector misses some logs that are created by the pods.

Additional Context

We’re currently doing performance testing to see if vector would be a good solution for our logging system. In the course of this test we found out that some logs of our logging stress test pods were missing. After investigation, we found out that changing glob_minimum_cooldown_ms to 500 ms instead of the default 4000 ms alleviates this issue. We’ve also found out that our AKS containerd cluster structures the logs as follows:

/var/log/pods/wld2_xload-zb8gs_d9fbd436-8dda-468f-83e3-c8c8ef42b4dd/xload# ls
0.log 
0.log.20210618-133010.gz 
0.log.20210618-133020.gz 
0.log.20210618-133031.gz 
0.log.20210618-133041

What we think it’s happening is the following:

Vector starts reading 0.log
The file gets “full”
The file gets rotated to 0.log.DATE
The file 0.log gets cleared (empty)

We think there might be two problems:

With the completely empty file in 4. , it might mess up vector check pointing system.
While vector is working, between step 1. and 4., vector loses some data that was deleted once the 4. executes.

About this issue

Original URL
State: open
Created 3 years ago
Comments: 16 (7 by maintainers)

Most upvoted comments

This is on openshift, so cri-o is the container runtime. The log directory on the node looks like this:

sh-4.4# ls /var/log/pods/ci_tide-757f9cc7d6-7457t_aedf9209-8e5f-4b1f-8015-63408a6d4622/tide
0.log  0.log.20210618-141411.gz  0.log.20210618-142945.gz  0.log.20210618-144638.gz  0.log.20210618-150251

Config looks like this:

    # file: vector.toml
    # Configuration for vector-agent
    # Docs: https://vector.dev/docs/
    # Set global options
    data_dir = "/var/lib/vector"

    # Ingest logs from Kubernetes
    [sources.kubernetes]
      type = "kubernetes_logs"

    # captures services in the ci namespace
    [transforms.ci-services]
      type = "filter"
      inputs = ["kubernetes"]
      condition.type = "check_fields"
      condition."kubernetes.pod_namespace.eq" = "ci"
      condition."kubernetes.container_name.neq" = "test"
      condition."kubernetes.container_name.neq" = "sidecar"
      condition."kubernetes.container_name.neq" = "namespace-ttl-controller" # generates enormous amounts of logs

    # parse the json so we can filter
    [transforms.json_parsed]
      type = "json_parser"
      inputs = ["ci-services"]
      drop_invalid = true

    [sinks.aws_cloudwatch_logs]
      type = "aws_cloudwatch_logs"
      inputs = ["json_parsed"]
      group_name = "app-ci-pod-logs"
      region = "us-east-1"
      stream_name = "{{ kubernetes.pod_node_name }}"
      encoding.codec = "json"

    # Release-controller logs
    [transforms.release_controllers]
      type = "filter"
      inputs = ["kubernetes"]
      condition.type = "check_fields"
      condition."kubernetes.pod_namespace.eq" = "ci"
      condition."kubernetes.pod_name.starts_with" = "release-controller-"
      condition."kubernetes.container_name.eq" = ["controller"]

    [sinks.rc_aws_cloudwatch_logs]
      type = "aws_cloudwatch_logs"
      inputs = ["release_controllers"]
      group_name = "app-ci-logs/{{ kubernetes.pod_labels.app }}"
      region = "us-east-1"
      stream_name = "{{ kubernetes.pod_name }}"
      encoding.codec = "text"

Vector version: vector:0.11.1-alpine (I’ll see that I bump that today).

Thanks for looking into this!

alvaroaleman on Jun 18, 2021

@alvaroaleman have you tried adjusting the glob_minimum_cooldown_ms configuration to see if it helps? It’s undocumented but can be applied to the kubernetes_logs source

So since I upgraded the vector version to 0.14.X last Friday this didn’t happen anymore. Since there is no clear repro it is hard to say if whatever caused this was fixed or if it is just coincidence. If it happens again, i’ll try glob_minimum_cooldown_ms (and keep watching this issue).

Thanks for all your help!

That’s great to hear - for what it’s worth we think the throughput volume was causing the files to be rotated from 0.log to 0.log.DATE before the source globbed for more files to watch, and right now the source is explicitly only looking for *.log in that directory.

spencergilbert on Jun 23, 2021

See also https://lucid.app/lucidchart/1c4192bb-4eef-4a3b-a1db-9293342ca7e2/edit?page=XK1FUROkUunC#.

Thanks for the info @portante - that lucidchart looks to be private though

Sorry, this is the public sharing link: https://lucid.app/lucidchart/invitations/accept/inv_2c2874c7-a202-4fa9-872d-e54705de517a

portante on Jun 21, 2021

@alvaroaleman thanks for the update! Do you have an estimated line/s that controller is generating? When the pipeline appeared to have stalled, it stalled for the entire kubernetes_logs source not just that particularly chatty pod?

I’m wondering if that could be related to backpressure from the sink, as a user I ran into something similar where it seemed like the source just silently stopped running and it turned out to be due to issues at the sink causing vector to slow the event collection

@spencergilbert We’ve tested if this behavior would stall the entire kubernetes_logs source as follows:

5 pods logging a constant number of messages | 20k messages per 5 minutes | expected to succeed
1 pod logging a large constant number of messages | 200k messages per 5 minutes | expected to fail

The outcome was as follows: Elastic Graph 01

This behavior seems to indicate that the source does not get stalled. The spikes on the graph seem to be vector catching some logs from the pod that logs 200k messages. We tested this with the default glob_minimum_cooldown_ms.

We also tested the same test with glob_minimum_cooldown_ms: 500 which mitigated the problem. Elastic Graph 02

DC-NunoAl on Jun 21, 2021

I’m generating events with flog so I should be able to crank up the throughput to something similar to what you’re seeing

spencergilbert on Jun 18, 2021

Do you have an estimated line/s that controller is generating?

The 25k line sample started at 17:13:54Z and ended at 17:19:01Z so that makes it about 69 message/second. It is very bursty though, so peak is likely much higher. There are logs from other components on the node at the same time.

alvaroaleman on Jun 18, 2021

Are the logs here unique or could we be seeing the same log line emitted @alvaroaleman, @DC-NunoAl

At the time I noticed this, we didn’t have any shipped logs for the past 12 hours and deleting the vector pod to get it re-created made the shipping happen so I doubt that’s it. I just bumped vector in https://github.com/openshift/release/pull/19445 and right now there are logs but the “repro” is “wait a couple of hours” so that might be expected. I will check again later and update here.

In case that helps, here are the last 25k loglines of the app where I noticed that (and probably noteworthy: It didn’t seem to happen for others (but I believe this one emits more logs than everything else)): https://gist.github.com/alvaroaleman/e8e57e62435fcf62b04ba5498e5e1857

alvaroaleman on Jun 18, 2021