vector: gcp_cloud_storage sink token refresh failing
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Hi, I am running vector as a daemon-set. We are running a big cluster with approximately 700+ nodes running. When we deploy vector, we initially see ratelimiting error. Looks like it is from k8s server. Find the log below. We get this logs approximately for 3-4 minutes before it resolves and starts working. Also there’s no back off here as we see continuous stream of below error.
2022-06-09T07:28:18.523823Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "429 Too Many Requests", message: "\"Too many requests, please try again later.\\n\"", reason: "Failed to parse error data", code: 429 }))
Once the above error resolves, vector works without any issues for 55 minutes and uploads logs to gcs bucket correctly. Then we start seeing request unauthorized error for gcp_cloud_storage requests and never recovers. Find the error log below
2022-06-09T08:25:51.906165Z ERROR sink{component_kind="sink" component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}:request{request_id=803}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 401 Unauthorized"
Currently I am suspecting if this issue is a side effect of initial rate limit error. I am deleting pods slowly so that new pods come up and their requests don’t get ratelimited. Will comment my observation in sometime.
Configuration
We're deploying it using latest helm chart available. Have added vector image version.
cloud_storage_analytics:
type: gcp_cloud_storage
inputs: [ test_logs ]
compression: gzip
batch:
max_bytes: 5000000
timeout_secs: 60
encoding:
codec: ndjson
bucket: test-bucket
key_prefix: 'analytics/dam/default/{{`{{ component_name }}`}}/year=%Y/month=%m/day=%d/hour=%H/%M_'
Version
0.22.0-distroless-libc
Debug Output
2022-06-09T07:28:18.523823Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "429 Too Many Requests", message: "\"Too many requests, please try again later.\\n\"", reason: "Failed to parse error data", code: 429 }))
2022-06-09T08:25:51.906165Z ERROR sink{component_kind="sink" component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}:request{request_id=803}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 401 Unauthorized"
Example Data
2022-06-09T07:28:18.523823Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: “429 Too Many Requests”, message: “"Too many requests, please try again later.\n"”, reason: “Failed to parse error data”, code: 429 }))
2022-06-09T08:25:51.906165Z ERROR sink{component_kind=“sink” component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}:request{request_id=803}: vector::sinks::util::retries: Not retriable; dropping the request. reason=“response status: 401 Unauthorized”
Additional Context
No response
References
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 16 (10 by maintainers)
Hi @spencergilbert , while trying to reproduce this issue, we faced another problem. But that is pretty straight forward. Vector is currently using /pods API for identifying pods from resources. https://vector.dev/docs/reference/configuration/sources/kubernetes_logs/#kubernetes-api-access-control
K8S has 2 APIs to get cluster pod details. Endpoint API : kubectl get --raw ‘/api/v1/namespaces/default/pods?watch=1’
Endpoint slices : kubectl get --raw /apis/discovery.k8s.io/v1/watch/namespaces/default/endpointslices
The default endpoints API which is currently being used chokes the network at scale and hence fails to list pods as well as fetch logs. A better alternate is endpoint slices. More document can be found below. https://kubernetes.io/blog/2020/09/02/scaling-kubernetes-networking-with-endpointslices/
I took a proper look at what was causing the above issue and seems like the disabled healthcheck is the reason.
According to what I see here: https://github.com/vectordotdev/vector/blob/a4782f83b0fa3ac8d6ca1f56a4990687e338ed06/src/sinks/gcs_common/config.rs#L128-L144
Token regenerator is being spawned only after the health check response is received (good to mention - it is spawned regardless of the healtcheck status).
So for me, enabling a healthcheck was the fix. Quite obscure though, are we expecting this behaviour or shall we add a warning in documentation or open up a PR with the fix?
cc @spencergilbert @srinidhis94
👍 would you mind opening an issue to track this?