fluent-bit: Stackdriver stops working after one hour: Oauth2
Bug Report
Describe the bug We use gke@1.16.15-gke.7800 and fluentbit v1.7.2. The configured google service account has the following roles:
- Service Account Token Creator
- Logs Bucket Writer
- Logs Configuration Writer
- Logs Writer
- Monitoring Metric Writer
The configuration works fine and logs are forwarded to stackdriver. But after exactly one hour of log forwarding, the stackdriver plugin fails to push new logs:
To Reproduce
[2021/03/22 14:33:08] [debug] [output:stackdriver:stackdriver.0] JWT signature:
ey........
[2021/03/22 14:33:08] [debug] [http_client] not using http_proxy for header
[2021/03/22 14:33:08] [debug] [http_client] header=POST /oauth2/v4/token HTTP/1.1
Host: www.googleapis.com
Content-Length: 169553
Content-Type: application/x-www-form-urlencoded
[2021/03/22 14:33:08] [ info] [oauth2] HTTP Status=400
[2021/03/22 14:33:08] [ info] [oauth2] payload:
{
"error": "unsupported_grant_type",
"error_description": "Invalid grant_type: "
}
[2021/03/22 14:33:08] [error] [output:stackdriver:stackdriver.0] error retrieving oauth2 access token
[2021/03/22 14:33:08] [error] [output:stackdriver:stackdriver.0] cannot retrieve oauth2 token
- Steps to reproduce the problem:
- service account created with mentioned roles
- fluent-bit version 1.7.2
- Cluster had been created with managed fluentbit logging, however
- scaled down google managed fluentbit. To not interfere with our fluentbit testing.
- Create daemonset and check Stackdriver logs
- after 1 hour we get mentioned OAuth2 errors.
- restart of the pods helps and logs are getting forwarded for another hour until the error returns.
Expected behavior Fluentbit should refresh oauth token correctly and not fail after one hour.
Your Environment
- Version used: 1.7.2
- Configuration: see below
- Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.16.15-gke.7800
- Operating System and version: https://fluent.github.io/helm-charts 0.12.3 with image fluent/fluent-bit 1.7.2
- Filters and plugins: see config
config:
service: |
[SERVICE]
Flush 5
Grace 120
Log_Level trace
Daemon off
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
inputs: |
[INPUT]
Name tail
Alias kube_containers_kube-system
Tag kube.<namespace_name>.<pod_name>.<container_name>
Tag_Regex (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
Path /var/log/containers/*_kube-system_*.log
DB /var/run/google-fluentbit/pos-files/flb_kube_kube-system.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 5
Read_from_Head True
[INPUT]
Name tail
Alias kube_containers_gke-system
Tag kube.<namespace_name>.<pod_name>.<container_name>
Tag_Regex (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
Path /var/log/containers/*_gke-system_*.log
DB /var/run/google-fluentbit/pos-files/flb_kube_gke-system.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 5
Read_from_Head True
[INPUT]
Name tail
Alias kube_containers
Tag kube.<namespace_name>.<pod_name>.<container_name>
Tag_Regex (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
Path /var/log/containers/*.log
Exclude_Path /var/log/containers/*_kube-system_*.log,/var/log/containers/*_istio-system_*.log,/var/log/containers/*_knative-serving_*.log,/var/log/containers/*_gke-system_*.log,/var/log/containers/*_config-management-system_*.log
DB /var/run/google-fluentbit/pos-files/flb_kube.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 5
Read_from_Head True
# Example:
# Dec 21 23:17:22 gke-foo-1-1-4b5cbd14-node-4eoj startupscript: Finished running startup script /var/run/google.startup.script
[INPUT]
Name tail
Parser syslog
Path /var/log/startupscript.log
DB /var/run/google-fluentbit/pos-files/startupscript.db
Alias startupscript
Tag startupscript
Read_from_Head True
# Logs from anetd for policy action
[INPUT]
Name tail
Parser network-log
Alias policy-action
Tag policy-action
Path /var/log/network/policy_action.log
DB /var/run/google-fluentbit/pos-files/policy-action.db
Skip_Long_Lines On
Refresh_Interval 5
Read_from_Head True
# Example:
# I1118 21:26:53.975789 6 proxier.go:1096] Port "nodePort for kube-system/default-http-backend:http" (:31429/tcp) was open before and is still needed
[INPUT]
Name tail
Alias kube-proxy
Tag kube-proxy
Path /var/log/kube-proxy.log
DB /var/run/google-fluentbit/pos-files/kube-proxy.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 1MB
Parser glog
Read_from_Head True
# Logs from systemd-journal for interesting services.
[INPUT]
Name systemd
Alias docker
Tag docker
Systemd_Filter _SYSTEMD_UNIT=docker.service
Path /var/log/journal
DB /var/run/google-fluentbit/pos-files/docker.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 1MB
[INPUT]
Name systemd
Alias container-runtime
Tag container-runtime
Systemd_Filter _SYSTEMD_UNIT=containerd.service
Path /var/log/journal
DB /var/run/google-fluentbit/pos-files/container-runtime.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 1MB
[INPUT]
Name systemd
Alias kubelet
Tag kubelet
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
Path /var/log/journal
DB /var/run/google-fluentbit/pos-files/kubelet.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 1MB
# kube-node-installation, kube-node-configuration, and kube-logrotate are
# oneshots, but it's extremely valuable to have their logs on Stackdriver
# as they can diagnose critical issues with node startup.
[INPUT]
Name systemd
Alias kube-node-installation
Tag kube-node-installation
Systemd_Filter _SYSTEMD_UNIT=kube-node-installation.service
Path /var/log/journal
DB /var/run/google-fluentbit/pos-files/kube-node-installation.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 1MB
[INPUT]
Name systemd
Alias kube-node-configuration
Tag kube-node-configuration
Systemd_Filter _SYSTEMD_UNIT=kube-node-configuration.service
Path /var/log/journal
DB /var/run/google-fluentbit/pos-files/kube-node-configuration.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 1MB
[INPUT]
Name systemd
Alias kube-logrotate
Tag kube-logrotate
Systemd_Filter _SYSTEMD_UNIT=kube-logrotate.service
Path /var/log/journal
DB /var/run/google-fluentbit/pos-files/kube-logrotate.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 1MB
[INPUT]
Name systemd
Alias node-problem-detector
Tag node-problem-detector
Systemd_Filter _SYSTEMD_UNIT=node-problem-detector.service
Path /var/log/journal
DB /var/run/google-fluentbit/pos-files/node-problem-detector.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 1MB
[INPUT]
Name systemd
Alias kube-container-runtime-monitor
Tag kube-container-runtime-monitor
Systemd_Filter _SYSTEMD_UNIT=kube-container-runtime-monitor.service
Path /var/log/journal
DB /var/run/google-fluentbit/pos-files/kube-container-runtime-monitor.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 1MB
[INPUT]
Name systemd
Alias kubelet-monitor
Tag kubelet-monitor
Systemd_Filter _SYSTEMD_UNIT=kubelet-monitor.service
Path /var/log/journal
DB /var/run/google-fluentbit/pos-files/kubelet-monitor.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 1MB
[INPUT]
Name systemd
Alias gcfsd
Tag gcfsd
Systemd_Filter _SYSTEMD_UNIT=gcfsd.service
Path /var/log/journal
DB /var/run/google-fluentbit/pos-files/gcfsd.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 1MB
[INPUT]
Name systemd
Alias gcfs-snapshotter
Tag gcfs-snapshotter
Systemd_Filter _SYSTEMD_UNIT=gcfs-snapshotter.service
Path /var/log/journal
DB /var/run/google-fluentbit/pos-files/gcfs-snapshotter.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 1MB
filters: |
[FILTER]
Name parser
Match kube.*
Key_Name log
Reserve_Data True
Parser docker
Parser containerd
[FILTER]
Name modify
Match *
Hard_rename log message
[FILTER]
Name parser
Match kube.*
Key_Name message
Reserve_Data True
Parser glog
Parser json
Parser logfmt
[FILTER]
Name modify
Match *
Copy level severity
[FILTER]
Name kubernetes
Match kube.*
Kube_Tag_Prefix kube.
Regex_Parser pod-tag-parser
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude On
outputs: |
[OUTPUT]
Name stackdriver
Match kube.*
Resource k8s_container
k8s_cluster_name sre-playground-cluster
k8s_cluster_location europe-west4
tag_prefix kube.
severity_key severity
[OUTPUT]
Name stackdriver
Match_Regex ^(?!kube).*
Resource global
k8s_cluster_name sre-playground-cluster
k8s_cluster_location europe-west4
customParsers: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
[PARSER]
Name containerd
Format regex
Regex ^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
[PARSER]
Name json
Format json
[PARSER]
Name logfmt
Format logfmt
[PARSER]
Name syslog
Format regex
Regex ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
Time_Key time
Time_Format %b %d %H:%M:%S
[PARSER]
Name glog
Format regex
Regex ^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source_file>[^ \]]+)\:(?<source_line>\d+)\]\s(?<message>.*)$
Time_Key time
Time_Format %m%d %H:%M:%S.%L
[PARSER]
Name network-log
Format json
Time_Key timestamp
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
[PARSER]
Name pod-tag-parser
Format regex
Regex (?<namespace_name>[^\.]+)\.(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)\.(?<container_name>.+)(?<docker_id>)
Of course, service account is mounted to the pods:
env:
- name: GOOGLE_SERVICE_CREDENTIALS
value: "/secret/fluentbit/stackdriver/service-account.json"
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 19 (7 by maintainers)
Commits related to this issue
- flb_oauth: add missing code to clear sds string Context: https://github.com/fluent/fluent-bit/issues/3267 — committed to hsmatulisgoogle/fluent-bit by hsmatulisgoogle 3 years ago
- flb_oauth: add missing code to clear sds string Context: https://github.com/fluent/fluent-bit/issues/3267 — committed to hsmatulisgoogle/fluent-bit by hsmatulisgoogle 3 years ago
- flb_oauth: add missing code to clear sds string Context: https://github.com/fluent/fluent-bit/issues/3267 Signed-off-by: Henrique S Matulis <69014250+hsmatulisgoogle@users.noreply.github.com> — committed to hsmatulisgoogle/fluent-bit by hsmatulisgoogle 3 years ago
- oauth2: add missing code to clear sds string (#3291) Context: https://github.com/fluent/fluent-bit/issues/3267 Signed-off-by: Henrique S Matulis <69014250+hsmatulisgoogle@users.noreply.github.com> — committed to fluent/fluent-bit by hsmatulisgoogle 3 years ago
- oauth2: add missing code to clear sds string (#3291) Context: https://github.com/fluent/fluent-bit/issues/3267 Signed-off-by: Henrique S Matulis <69014250+hsmatulisgoogle@users.noreply.github.com> — committed to fluent/fluent-bit by hsmatulisgoogle 3 years ago
- oauth2: add missing code to clear sds string (#3291) Context: https://github.com/fluent/fluent-bit/issues/3267 Signed-off-by: Henrique S Matulis <69014250+hsmatulisgoogle@users.noreply.github.com> — committed to DrewZhang13/fluent-bit by hsmatulisgoogle 3 years ago
- oauth2: add missing code to clear sds string (#3291) Context: https://github.com/fluent/fluent-bit/issues/3267 Signed-off-by: Henrique S Matulis <69014250+hsmatulisgoogle@users.noreply.github.com> — committed to DrewZhang13/fluent-bit by hsmatulisgoogle 3 years ago
Thanks for the investigation! I think the issue is the following: https://github.com/fluent/fluent-bit/blob/491889b5601f7b353df31a46e0b867ee5464b376/src/flb_oauth2.c#L245 Since payload is an sds_t string it has a header which didn’t get updated containing the string length, meaning after the clearing the first byte strings are appended to the end of the buffer rather than the start. It looks like we can fix this by calling
flb_sds_len_set(ctx->payload, 0)
. I added the draft pr https://github.com/fluent/fluent-bit/pull/3291 to attempt to fix this, but have not tested it yet