vector: gcp_cloud_storage sink token refresh failing

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hi, I am running vector as a daemon-set. We are running a big cluster with approximately 700+ nodes running. When we deploy vector, we initially see ratelimiting error. Looks like it is from k8s server. Find the log below. We get this logs approximately for 3-4 minutes before it resolves and starts working. Also there’s no back off here as we see continuous stream of below error.

2022-06-09T07:28:18.523823Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "429 Too Many Requests", message: "\"Too many requests, please try again later.\\n\"", reason: "Failed to parse error data", code: 429 }))

Once the above error resolves, vector works without any issues for 55 minutes and uploads logs to gcs bucket correctly. Then we start seeing request unauthorized error for gcp_cloud_storage requests and never recovers. Find the error log below

2022-06-09T08:25:51.906165Z ERROR sink{component_kind="sink" component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}:request{request_id=803}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 401 Unauthorized"

Currently I am suspecting if this issue is a side effect of initial rate limit error. I am deleting pods slowly so that new pods come up and their requests don’t get ratelimited. Will comment my observation in sometime.

Configuration

We're deploying it using latest helm chart available. Have added vector image version.

cloud_storage_analytics:
  type: gcp_cloud_storage
  inputs: [ test_logs ]
  compression: gzip
  batch:
    max_bytes: 5000000
    timeout_secs: 60
  encoding:
    codec: ndjson
  bucket: test-bucket
  key_prefix: 'analytics/dam/default/{{`{{ component_name }}`}}/year=%Y/month=%m/day=%d/hour=%H/%M_'

Version

0.22.0-distroless-libc

Debug Output

2022-06-09T07:28:18.523823Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "429 Too Many Requests", message: "\"Too many requests, please try again later.\\n\"", reason: "Failed to parse error data", code: 429 }))


2022-06-09T08:25:51.906165Z ERROR sink{component_kind="sink" component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}:request{request_id=803}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 401 Unauthorized"

Example Data

2022-06-09T07:28:18.523823Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: “429 Too Many Requests”, message: “"Too many requests, please try again later.\n"”, reason: “Failed to parse error data”, code: 429 }))

2022-06-09T08:25:51.906165Z ERROR sink{component_kind=“sink” component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}:request{request_id=803}: vector::sinks::util::retries: Not retriable; dropping the request. reason=“response status: 401 Unauthorized”

Additional Context

No response

References

No response

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 3
Comments: 16 (10 by maintainers)

Most upvoted comments

Hi @spencergilbert , while trying to reproduce this issue, we faced another problem. But that is pretty straight forward. Vector is currently using /pods API for identifying pods from resources. https://vector.dev/docs/reference/configuration/sources/kubernetes_logs/#kubernetes-api-access-control

K8S has 2 APIs to get cluster pod details. Endpoint API : kubectl get --raw ‘/api/v1/namespaces/default/pods?watch=1’

Endpoint slices : kubectl get --raw /apis/discovery.k8s.io/v1/watch/namespaces/default/endpointslices

The default endpoints API which is currently being used chokes the network at scale and hence fails to list pods as well as fetch logs. A better alternate is endpoint slices. More document can be found below. https://kubernetes.io/blog/2020/09/02/scaling-kubernetes-networking-with-endpointslices/

srinidhis94 on Jun 20, 2022

I took a proper look at what was causing the above issue and seems like the disabled healthcheck is the reason.

According to what I see here: https://github.com/vectordotdev/vector/blob/a4782f83b0fa3ac8d6ca1f56a4990687e338ed06/src/sinks/gcs_common/config.rs#L128-L144

Token regenerator is being spawned only after the health check response is received (good to mention - it is spawned regardless of the healtcheck status).

So for me, enabling a healthcheck was the fix. Quite obscure though, are we expecting this behaviour or shall we add a warning in documentation or open up a PR with the fix?

cc @spencergilbert @srinidhis94

punkerpunker on Oct 11, 2022

👍 would you mind opening an issue to track this?

spencergilbert on Jun 21, 2022