vector: Vector agent stops watching logs from new pods

Vector Version

version="0.15.0" arch="x86_64" build_id="994d812 2021-07-16"

Vector Configuration File

# Configuration for vector.
# Docs: https://vector.dev/docs/

data_dir = "/vector-data-dir"

[api]
  enabled = false
  address = "0.0.0.0:8686"
  playground = true

[log_schema]
  host_key = "host"
  message_key = "log"
  source_type_key = "source_type"
  timestamp_key = "time"

# Ingest logs from Kubernetes.
[sources.kubernetes_logs]
  type = "kubernetes_logs"
  extra_field_selector = "metadata.namespace==default"
  max_line_bytes = 262144



# Emit internal Vector metrics.
[sources.internal_metrics]
  type = "internal_metrics"

# Expose metrics for scraping in the Prometheus format.
[sinks.prometheus_sink]
  address = "0.0.0.0:2020"
  inputs = ["internal_metrics"]
  type = "prometheus"


[transforms.cluster_tagging]
  inputs = ["kubernetes_logs"]
  source = "...parsing json...adding some fields"
  type = "remap"

[sinks.splunk]
  type = "splunk_hec"
  inputs = ["cluster_tagging"]
  # ...
  batch.timeout_secs = 10
  request.concurrency = "adaptive"

Debug Output

The issue reproduces in production where Vector cannot be run in Debug mode.

Expected Behavior

Vector doesn’t ignore logs from new pods. This is quite disturbing given it’s difficult to detect, since there are no errors/metrics I can alert on.

Actual Behavior

Vector is deployed in EKS (Kubernetes 1.17+) as agent (DaemonSet). I’m doing releases on a regular basis (meaning pods get deleted/re-created at least weekly). I noticed that after one such release multiple clusters stopped delivering logs. Although containers were running and logging (nothing really changed), Vector just was ignoring new pods. I upgraded Vector to 0.15 (from 0.13) as I saw a few similar issues and some desync errors in logs. However it seems to happen again - a cluster stopped delivering logs (except a single service which wasn’t released). In logs I see lots of desync errors, however it happened days before Vector started ignoring logs.

Jul 28 11:18:52.156 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
Jul 28 11:18:52.156  WARN source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync

And 3 days ago Vector just stopped watching logs from old pods and didn’t start watching new ones. No other errors prior this. Those are the last logs from Vector. Once I restarted the DaemonSet it detected new logs and started consuming them.

ug 03 10:21:46.149  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}:file_server: vector::internal_events::file::source: Found new file to watch. path=/var/log/pods/default_foo-856498f4fc-x795d_dc8d9e9e-f0a8-4594-9fc9-e3a83e5cbb77/foo/0.log
Aug 03 10:21:46.149  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}:file_server: vector::internal_events::file::source: Stopped watching file. path=/var/log/pods/default_foo-856498f4fc-x795d_dc8d9e9e-f0a8-4594-9fc9-e3a83e5cbb77/foo/0.log
Aug 03 10:43:54.928  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}:file_server: vector::internal_events::file::source: Stopped watching file. path=/var/log/pods/default_foo856498f4fc-x795d_dc8d9e9e-f0a8-4594-9fc9-e3a83e5cbb77/foo/0.log

Additional Context

References

#7934 seems to show same symptoms

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 19
Comments: 32 (12 by maintainers)

Most upvoted comments

We’ve merged in a PR replacing our in-house implementation with kube’s library. The new code will be available in the 0.21 release (or in the nightly releases now). We’re hoping this change solves some of the failure cases and isolates the rest so they’ll be easier to diagnose and resolve.

We’d love to get feedback from anyone who upgrades to the new code!

+11

spencergilbert on Apr 1, 2022

waiting for kube-rs (if i noticed correctly)… But when to expect 0.21 ? Impatient to evaluate it

We’re working on cutting that release this week 👍

spencergilbert on Apr 11, 2022

Hi,

I think there are two separate issues here: there are logs with vector::kubernetes::reflector: Watcher error. error=BadStatus { status: 401 } Which is caused by the token rotation.

And there are logs when the vector just stops watching for new pods, ex: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5

This happens on all clusters for me with the 0.20.0 version - there’s no token rotation, and the watch stream failed message starts to appear regularly after the vector pod startup.

rati3l on Mar 21, 2022

Hi have the same on 0.16 and 0.17, but 0.15.2 works flawlessly in my case. Repeats on few k8s clusters.

imcitius on Oct 22, 2021

@tomer-epstein @BredSt 401 errors are most likely caused by https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume - which is default since K8s 1.21 .

It seems that Vector doesn’t support token rotation, that’s why it stops working after expiration time. You can work around this by either disabling the feature flag (impossible after 1.22) or manually mounting the service account token.

Vector freezing after 401 error happens is still related to this bug though.

eplightning on Jan 11, 2022

@spencergilbert already tested 0.21, but waiting for helm chart 0.10 to be published. Getting connection error on POD boot, os error 111 during ...v1/namespaces? request. My thought it’s due to RBAC policy maybe.

The upgrade guide and highlights can be seen here: https://vector.dev/highlights/2022-03-22-0-21-0-upgrade-guide/#kubernetes-logs and https://vector.dev/highlights/2022-03-28-kube-for-kubernetes_logs/

spencergilbert on Apr 14, 2022

Having the same issue. version=“0.18.1” arch=“x86_64” build_id=“c4adb60 2021-11-30”

2021-12-09T13:50:10.928892Z ERROR source{component_kind="source" component_id=kube_logs_1 component_type=kubernetes_logs component_name=kube_logs_1}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5 2021-12-09T13:50:10.929016Z WARN source{component_kind="source" component_id=kube_logs_1 component_type=kubernetes_logs component_name=kube_logs_1}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync

karlmartink on Dec 9, 2021

I am also seeing the same errors: Sep 21 09:02:48.653 WARN source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync Sep 21 09:02:49.228 ERROR source{component_kind="source" component_name=k8s_all component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5

@spencergilbert we are about migrating our apps into EKS with using vector Kubernetes plugin, any ETA when you fix it? This is critical for us.

gartemiev on Sep 21, 2021