vector: Vector agent stops watching logs from new pods
Vector Version
version="0.15.0" arch="x86_64" build_id="994d812 2021-07-16"
Vector Configuration File
# Configuration for vector.
# Docs: https://vector.dev/docs/
data_dir = "/vector-data-dir"
[api]
  enabled = false
  address = "0.0.0.0:8686"
  playground = true
[log_schema]
  host_key = "host"
  message_key = "log"
  source_type_key = "source_type"
  timestamp_key = "time"
# Ingest logs from Kubernetes.
[sources.kubernetes_logs]
  type = "kubernetes_logs"
  extra_field_selector = "metadata.namespace==default"
  max_line_bytes = 262144
# Emit internal Vector metrics.
[sources.internal_metrics]
  type = "internal_metrics"
# Expose metrics for scraping in the Prometheus format.
[sinks.prometheus_sink]
  address = "0.0.0.0:2020"
  inputs = ["internal_metrics"]
  type = "prometheus"
[transforms.cluster_tagging]
  inputs = ["kubernetes_logs"]
  source = "...parsing json...adding some fields"
  type = "remap"
[sinks.splunk]
  type = "splunk_hec"
  inputs = ["cluster_tagging"]
  # ...
  batch.timeout_secs = 10
  request.concurrency = "adaptive"
Debug Output
The issue reproduces in production where Vector cannot be run in Debug mode.
Expected Behavior
Vector doesn’t ignore logs from new pods. This is quite disturbing given it’s difficult to detect, since there are no errors/metrics I can alert on.
Actual Behavior
Vector is deployed in EKS (Kubernetes 1.17+) as agent (DaemonSet). I’m doing releases on a regular basis (meaning pods get deleted/re-created at least weekly). I noticed that after one such release multiple clusters stopped delivering logs. Although containers were running and logging (nothing really changed), Vector just was ignoring new pods. I upgraded Vector to 0.15 (from 0.13) as I saw a few similar issues and some desync errors in logs. However it seems to happen again - a cluster stopped delivering logs (except a single service which wasn’t released). In logs I see lots of desync errors, however it happened days before Vector started ignoring logs.
Jul 28 11:18:52.156 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
Jul 28 11:18:52.156  WARN source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync
And 3 days ago Vector just stopped watching logs from old pods and didn’t start watching new ones. No other errors prior this. Those are the last logs from Vector. Once I restarted the DaemonSet it detected new logs and started consuming them.
ug 03 10:21:46.149  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}:file_server: vector::internal_events::file::source: Found new file to watch. path=/var/log/pods/default_foo-856498f4fc-x795d_dc8d9e9e-f0a8-4594-9fc9-e3a83e5cbb77/foo/0.log
Aug 03 10:21:46.149  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}:file_server: vector::internal_events::file::source: Stopped watching file. path=/var/log/pods/default_foo-856498f4fc-x795d_dc8d9e9e-f0a8-4594-9fc9-e3a83e5cbb77/foo/0.log
Aug 03 10:43:54.928  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}:file_server: vector::internal_events::file::source: Stopped watching file. path=/var/log/pods/default_foo856498f4fc-x795d_dc8d9e9e-f0a8-4594-9fc9-e3a83e5cbb77/foo/0.log
Additional Context
References
#7934 seems to show same symptoms
About this issue
- Original URL
 - State: closed
 - Created 3 years ago
 - Reactions: 19
 - Comments: 32 (12 by maintainers)
 
We’ve merged in a PR replacing our in-house implementation with
kube’s library. The new code will be available in the0.21release (or in the nightly releases now). We’re hoping this change solves some of the failure cases and isolates the rest so they’ll be easier to diagnose and resolve.We’d love to get feedback from anyone who upgrades to the new code!
We’re working on cutting that release this week 👍
Hi,
I think there are two separate issues here: there are logs with
vector::kubernetes::reflector: Watcher error. error=BadStatus { status: 401 }Which is caused by the token rotation.And there are logs when the vector just stops watching for new pods, ex:
vector::internal_events::kubernetes::reflector: Handling desync. error=Desync vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5This happens on all clusters for me with the 0.20.0 version - there’s no token rotation, and the watch stream failed message starts to appear regularly after the vector pod startup.
Hi have the same on 0.16 and 0.17, but 0.15.2 works flawlessly in my case. Repeats on few k8s clusters.
@tomer-epstein @BredSt 401 errors are most likely caused by https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume - which is default since K8s 1.21 .
It seems that Vector doesn’t support token rotation, that’s why it stops working after expiration time. You can work around this by either disabling the feature flag (impossible after 1.22) or manually mounting the service account token.
Vector freezing after 401 error happens is still related to this bug though.
The upgrade guide and highlights can be seen here: https://vector.dev/highlights/2022-03-22-0-21-0-upgrade-guide/#kubernetes-logs and https://vector.dev/highlights/2022-03-28-kube-for-kubernetes_logs/
Having the same issue. version=“0.18.1” arch=“x86_64” build_id=“c4adb60 2021-11-30”
2021-12-09T13:50:10.928892Z ERROR source{component_kind="source" component_id=kube_logs_1 component_type=kubernetes_logs component_name=kube_logs_1}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5 2021-12-09T13:50:10.929016Z WARN source{component_kind="source" component_id=kube_logs_1 component_type=kubernetes_logs component_name=kube_logs_1}: vector::internal_events::kubernetes::reflector: Handling desync. error=DesyncI am also seeing the same errors:
Sep 21 09:02:48.653 WARN source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync Sep 21 09:02:49.228 ERROR source{component_kind="source" component_name=k8s_all component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5@spencergilbert we are about migrating our apps into EKS with using vector Kubernetes plugin, any ETA when you fix it? This is critical for us.