vector: Vector randomly stops shipping certain k8s logs
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
After a while, our vector daemonset will randomly stop shipping logs for a select service (some other pods will keep shipping logs)
Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: vector
namespace: observability
labels:
app.kubernetes.io/name: vector
app.kubernetes.io/instance: vector
app.kubernetes.io/component: Agent
app.kubernetes.io/version: "0.20.0-distroless-libc"
data:
agent.yaml: |
data_dir: /vector-data-dir
api:
enabled: true
address: 127.0.0.1:8686
playground: false
sources:
kubernetes_logs:
type: kubernetes_logs
host_metrics:
filesystem:
devices:
excludes: [binfmt_misc]
filesystems:
excludes: [binfmt_misc]
mountPoints:
excludes: ["*/proc/sys/fs/binfmt_misc"]
type: host_metrics
internal_metrics:
type: internal_metrics
transforms:
setlevel:
type: remap
inputs: [kubernetes_logs]
source: |-
.temp = parse_json!(.message)
if !exists(parse_json!(.message).level) {
.level = "other"
} else {
.level = .temp.level
}
if exists(.temp.uri) {
.uri = .temp.uri
}
if exists(.temp.msg) {
.msg = .temp.msg
}
if exists(.temp.lat) {
.lat = .temp.lat
.lon = .temp.lon
}
del(.temp)
sinks:
prom_exporter:
type: prometheus_exporter
inputs: [host_metrics, internal_metrics]
address: 0.0.0.0:9090
# stdout:
# type: console
# inputs: [setlevel]
# encoding:
# codec: json
loki:
type: loki
inputs:
- "setlevel"
endpoint: https://logs-prod-us-central1.grafana.net
compression: gzip
# remove_label_fields: true
encoding:
codec: json
auth:
password: ${LOKI_PASSWORD}
user: "${LOKI_USERNAME}"
strategy: basic
labels:
namespace: "{{ kubernetes.pod_namespace }}"
pod: "{{ kubernetes.pod_name }}"
level: "{{ level }}"
app_label: "{{ kubernetes.pod_label.app }}"
node: "{{ kubernetes.pod_node_name }}"
pod_owner: "{{ kubernetes.pod_owner }}"
cluster: ${CLUSTER_NAME}
### Version
timberio/vector:0.20.0-distroless-libc
### Debug Output
```text
Working on trying to get relevant debugging output, sends a lot of TRACE logs currently :P
Example Data
{"func":"github.com/xxx/xxx/crdb.ConnectToDB","level":"debug","msg":"Connected to CRDB","time":"2022-03-29T17:43:21.015Z"}
Additional Context
Only logs from vector:
2022-03-29T17:35:57.183220Z WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
been rate limited 10 times.
2022-03-29T17:35:57.183241Z WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
2022-03-29T17:35:57.200043Z WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
eing rate limited.
2022-03-29T17:36:32.201827Z WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
been rate limited 8 times.
2022-03-29T17:36:32.201877Z WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
2022-03-29T17:36:32.437566Z WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
eing rate limited.
References
Similar to this I’ve had in the past https://github.com/vectordotdev/vector/discussions/8634
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 36
- Comments: 66 (19 by maintainers)
@jszwedko we have our glob cooldown to 2 seconds and still have observed it. Ultimately we have to move to something that doesn’t drop logs, because we depend on logs to know when errors occur.
I can’t imagine that k8s is not a massive user base of vector. We aren’t logging very quickly either, maybe 10-20/s.
what would need to be done from us to get more urgency behind improving the kubernetes experience? I truly want to use vector but can’t.
@jszwedko We also faced with this issue after started rollout Vector to our production. We identified that the most safest version for using with
kubernetes_logssource currently is0.19.3. If we upgrade it to 0.2x.x we immediately start facing with 2 issues:In case of new versions (0.20.x) only working solution now just to restart Vector pods after some of these issues start happening - this helps in both cases. Both issues makes Vector completely unusable in production environments with
kubernetes_logssince there is big risk to lose logs. I’m really hope that root cause of these issues will be identified and fixed in upcoming releases.I was able to find some settings that are now giving me a much more reliable deployment on Kubernetes after doing the following:
0.25.X. It doesn’t seem to be mentioned in the release notes, but release 0.25.0 contains a very impactful fix for thekubernetes_logssource to help with annotations (relevant issue: https://github.com/vectordotdev/vector/issues/13467)kubernetes_logssource config:The combination of these two changes have lead to zero dropped/missing logs over the past couple weeks. Previously I was using a lower
max_line_bytesmax_read_bytesvalue as I’ve seen recommended above and elsewhere, but still much higher than the default so after so many attempts and little to no change I stopped increasing it. After revisiting it and trying something wildly large, to my surprise it worked!I was still receiving annotation failures though. After deep diving into the commit history I saw a seemingly unannounced workaround was in place for https://github.com/vectordotdev/vector/issues/13467 in the latest version, and since upgrading haven’t seen any annotation issues.
@zamazan4ik fluentbit is a good alternative
Thanks for all of the thoughts everyone. I can appreciate the severity of these issues and believe this area (the
kubernetes_logssource) is something we’ll be digging into again soon.There seem to be a mix of issues reported in the comments above. I think they can be categorized into:
kubernetes_logssource. One thing you could try to do there, @igor-nikiforov, is bisect down the nightly versions of Vector to identify the first one that manifests the issue. For v0.19.3 to v0.20.0, this would be from 2021-12-28 to 2022-02-10. If we had a narrower set of commits to look at, something might jump out.glob_minimum_cooldown_msby dropping it down so that Vector scans for new files more frequently andmax_read_bytesby increasing it so that Vector reads more from each file before continuing on to the next file.If you don’t think your issue is represented in the above set, please leave an additional comment!
This has just started happening again, no error or warning logs. Anything else I should be looking for or trying? We can’t be dropping our production logs.
I will add that restart vector seems to solve the issue, but that’s not a sustainable solution.
2 years ago… Doesn’t feel remotely appreciated. I’m wondering if the Datadog acq. is to blame.
Great! for us the glob time drop really did it for us
Wow, that’s good to know. I hope to see comments from Vector dev team about the possible fixes for the issues.
Thanks for the additional context @kitos9112 @Genry1 and @lpugoy ! I split this off to a separate issue since it seems to be different from the original report here. See: https://github.com/vectordotdev/vector/issues/13467
actual app
Still getting the errors, and the pods are not being removed, logs are coming in much slower now as well