vector: Vector randomly stops shipping certain k8s logs

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

After a while, our vector daemonset will randomly stop shipping logs for a select service (some other pods will keep shipping logs)

Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: vector
  namespace: observability
  labels:
    app.kubernetes.io/name: vector
    app.kubernetes.io/instance: vector
    app.kubernetes.io/component: Agent
    app.kubernetes.io/version: "0.20.0-distroless-libc"
data:
  agent.yaml: |
    data_dir: /vector-data-dir
    api:
      enabled: true
      address: 127.0.0.1:8686
      playground: false
    sources:
      kubernetes_logs:
        type: kubernetes_logs
      host_metrics:
        filesystem:
          devices:
            excludes: [binfmt_misc]
          filesystems:
            excludes: [binfmt_misc]
          mountPoints:
            excludes: ["*/proc/sys/fs/binfmt_misc"]
        type: host_metrics
      internal_metrics:
        type: internal_metrics
    transforms:
      setlevel:
        type: remap
        inputs: [kubernetes_logs]
        source: |-
          .temp = parse_json!(.message)
          if !exists(parse_json!(.message).level) {
            .level = "other"
          } else {
            .level = .temp.level
          }
          if exists(.temp.uri) {
            .uri = .temp.uri
          }
          if exists(.temp.msg) {
            .msg = .temp.msg
          }
          if exists(.temp.lat) {
            .lat = .temp.lat
            .lon = .temp.lon
          }
          del(.temp)
    sinks:
      prom_exporter:
        type: prometheus_exporter
        inputs: [host_metrics, internal_metrics]
        address: 0.0.0.0:9090
      # stdout:
      #   type: console
      #   inputs: [setlevel]
      #   encoding:
      #     codec: json
      loki:
        type: loki
        inputs:
          - "setlevel"
        endpoint: https://logs-prod-us-central1.grafana.net
        compression: gzip
        # remove_label_fields: true
        encoding:
          codec: json
        auth:
          password: ${LOKI_PASSWORD}
          user: "${LOKI_USERNAME}"
          strategy: basic
        labels:
          namespace: "{{ kubernetes.pod_namespace }}"
          pod: "{{ kubernetes.pod_name }}"
          level: "{{ level }}"
          app_label: "{{ kubernetes.pod_label.app }}"
          node: "{{ kubernetes.pod_node_name }}"
          pod_owner: "{{ kubernetes.pod_owner }}"
          cluster: ${CLUSTER_NAME}


### Version

timberio/vector:0.20.0-distroless-libc

### Debug Output

```text
Working on trying to get relevant debugging output, sends a lot of TRACE logs currently :P

Example Data

{"func":"github.com/xxx/xxx/crdb.ConnectToDB","level":"debug","msg":"Connected to CRDB","time":"2022-03-29T17:43:21.015Z"}

Additional Context

Only logs from vector:

2022-03-29T17:35:57.183220Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
 been rate limited 10 times.
 2022-03-29T17:35:57.183241Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
 l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
 2022-03-29T17:35:57.200043Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
 eing rate limited.
 2022-03-29T17:36:32.201827Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
 been rate limited 8 times.
 2022-03-29T17:36:32.201877Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
 l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
 2022-03-29T17:36:32.437566Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
 eing rate limited.

References

Similar to this I’ve had in the past https://github.com/vectordotdev/vector/discussions/8634

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 36
  • Comments: 66 (19 by maintainers)

Most upvoted comments

@jszwedko we have our glob cooldown to 2 seconds and still have observed it. Ultimately we have to move to something that doesn’t drop logs, because we depend on logs to know when errors occur.

I can’t imagine that k8s is not a massive user base of vector. We aren’t logging very quickly either, maybe 10-20/s.

what would need to be done from us to get more urgency behind improving the kubernetes experience? I truly want to use vector but can’t.

@jszwedko We also faced with this issue after started rollout Vector to our production. We identified that the most safest version for using with kubernetes_logs source currently is 0.19.3. If we upgrade it to 0.2x.x we immediately start facing with 2 issues:

  • Vector just stopping watch the new pod logs after small period of time. It seems like this is random and does not affect all pods but it usually take near 2-10 minutes for it to start happening.
  • https://github.com/vectordotdev/vector/issues/13467 - which is affects StatefulSet pods and easily reproducable when you have 2-3 nodes with same count of StatefulSet pods.

In case of new versions (0.20.x) only working solution now just to restart Vector pods after some of these issues start happening - this helps in both cases. Both issues makes Vector completely unusable in production environments with kubernetes_logs since there is big risk to lose logs. I’m really hope that root cause of these issues will be identified and fixed in upcoming releases.

I was able to find some settings that are now giving me a much more reliable deployment on Kubernetes after doing the following:

  1. Update Vector to version 0.25.X. It doesn’t seem to be mentioned in the release notes, but release 0.25.0 contains a very impactful fix for the kubernetes_logs source to help with annotations (relevant issue: https://github.com/vectordotdev/vector/issues/13467)
  2. Add these settings to your kubernetes_logs source config:
    max_line_bytes: 16777216
    max_read_bytes: 16777216
    glob_minimum_cooldown_ms: 1000
    

The combination of these two changes have lead to zero dropped/missing logs over the past couple weeks. Previously I was using a lower max_line_bytes max_read_bytes value as I’ve seen recommended above and elsewhere, but still much higher than the default so after so many attempts and little to no change I stopped increasing it. After revisiting it and trying something wildly large, to my surprise it worked!

I was still receiving annotation failures though. After deep diving into the commit history I saw a seemingly unannounced workaround was in place for https://github.com/vectordotdev/vector/issues/13467 in the latest version, and since upgrading haven’t seen any annotation issues.

@zamazan4ik fluentbit is a good alternative

Thanks for all of the thoughts everyone. I can appreciate the severity of these issues and believe this area (the kubernetes_logs source) is something we’ll be digging into again soon.

There seem to be a mix of issues reported in the comments above. I think they can be categorized into:

  • https://github.com/vectordotdev/vector/issues/13467
    • this issue is fairly well understood at this point, but not easily remedied
  • Vector just stopping watch the new pod logs after small period of time. It seems like this is random and does not affect all pods but it usually take near 2-10 minutes for it to start happening.

    • From: https://github.com/vectordotdev/vector/issues/12014#issuecomment-1216723559
    • I don’t see anything in particular that jumps out from 0.19.3 -> 0.20.X that seems related to the kubernetes_logs source. One thing you could try to do there, @igor-nikiforov, is bisect down the nightly versions of Vector to identify the first one that manifests the issue. For v0.19.3 to v0.20.0, this would be from 2021-12-28 to 2022-02-10. If we had a narrower set of commits to look at, something might jump out.
  • Issues with Vector not picking up files because they are rotated too quickly. For this I’d recommend trying to tweak glob_minimum_cooldown_ms by dropping it down so that Vector scans for new files more frequently and max_read_bytes by increasing it so that Vector reads more from each file before continuing on to the next file.
  • Logs stop shipping and many annotation errors are seen. This could be due to https://github.com/vectordotdev/vector/issues/13467 or the cached pod metadata being deleted before Vector finishes reading the associated files. For the latter you can try increasing delay_deletion_ms to a higher value than 60000 (60s) so that Vector holds on to the metadata longer. This can exacerbate https://github.com/vectordotdev/vector/issues/13467 though. Notably the annotation issues shouldn’t stop Vector from reading the logs, they will just be lacking the Kubernetes metadata (like pod namespace, etc.).

If you don’t think your issue is represented in the above set, please leave an additional comment!

This has just started happening again, no error or warning logs. Anything else I should be looking for or trying? We can’t be dropping our production logs.

I will add that restart vector seems to solve the issue, but that’s not a sustainable solution.

I can appreciate the severity of these issues and believe this area (the kubernetes_logs source) is something we’ll be digging into again soon.

2 years ago… Doesn’t feel remotely appreciated. I’m wondering if the Datadog acq. is to blame.

I was able to find some settings that are now giving me a much more reliable deployment on Kubernetes after doing the following:

  1. Update Vector to version 0.25.X. It doesn’t seem to be mentioned in the release notes, but release 0.25.0 contains a very impactful fix for the kubernetes_logs source to help with annotations (relevant issue: kubernetes_logs source seems to fail to annotate when pod names are reused #13467)
  2. Add these settings to your kubernetes_logs source config:
    max_line_bytes: 16777216
    max_read_bytes: 16777216
    glob_minimum_cooldown_ms: 1000
    

The combination of these two changes have lead to zero dropped/missing logs over the past couple weeks. Previously I was using a lower max_line_bytes max_read_bytes value as I’ve seen recommended above and elsewhere, but still much higher than the default so after so many attempts and little to no change I stopped increasing it. After revisiting it and trying something wildly large, to my surprise it worked!

I was still receiving annotation failures though. After deep diving into the commit history I saw a seemingly unannounced workaround was in place for #13467 in the latest version, and since upgrading haven’t seen any annotation issues.

Great! for us the glob time drop really did it for us

Wow, that’s good to know. I hope to see comments from Vector dev team about the possible fixes for the issues.

Thanks for the additional context @kitos9112 @Genry1 and @lpugoy ! I split this off to a separate issue since it seems to be different from the original report here. See: https://github.com/vectordotdev/vector/issues/13467

actual app

Still getting the errors, and the pods are not being removed, logs are coming in much slower now as well