vector: Datadog Agent Source Regression in v0.24x
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
We noticed the following issue when upgrading from v0.23.3 to v0.24.0:
(1) CPU became spikey/less consistent
The charts below show how CPU explodes, Datadog forwarder error rates climb, HAProxy 5xx rates climb

(2) 504 error codes coming from the Datadog Agents writing to Vector for only the /api/beta/sketches endpoint
2022-11-15 22:52:12 UTC | CORE | ERROR | (pkg/forwarder/worker.go:184 in process) | Error while processing transaction: error "504 Gateway Time-out" while sending transaction to "http://vector-haproxy.vector.svc.cluster.local:6000/api/beta/sketches", rescheduling it: "<html><body><h1>504 Gateway Time-out</h1>\nThe server didn't respond in time.\n</body></html>\n"
This is also apparent in errors surfacing from HAProxy (deployed via the Vector Helm chart). HAProxy is using a leastconn balance strategy.
(3) Error logs coming from Vector for shutting down connections
{"host":"vector-599576bd9b-w2bq7","message":"error shutting down IO: Transport endpoint is not connected (os error 107)","metadata":{"kind":"event","level":"DEBUG","module_path":"hyper::proto::h1::conn","target":"hyper::proto::h1::conn"},"pid":1,"source_ty
pe":"internal_logs","timestamp":"2022-11-16T20:11:46.533762489Z"}
{"host":"vector-599576bd9b-w2bq7","message":"connection error: error shutting down connection: Transport endpoint is not connected (os error 107)","metadata":{"kind":"event","level":"DEBUG","module_path":"hyper::server::server::new_svc","target":"hyper::se
rver::server::new_svc"},"pid":1,"source_type":"internal_logs","timestamp":"2022-11-16T20:11:46.533777847Z"}
Configuration
data_dir: /vector-data-dir
api:
enabled: true
address: 127.0.0.1:8686
playground: false
sources:
internal_logs:
type: internal_logs
# Datadog Agent telemetry
datadog_agent:
type: datadog_agent
address: "0.0.0.0:6000"
multiple_outputs: true # To automatically separate metrics and logs
sinks:
console:
type: console
inputs:
- internal_logs
target: stdout
encoding:
codec: json
# Datadog metrics output
datadog_metrics:
type: datadog_metrics
inputs:
- <inputs>...
api_key: "${DATADOG_API_KEY}"
Version
0.24.0-distroless-libc
Debug Output
I can only recreate this issue in critical environments where I can't create this output information :(
Example Data
I’m not sure what the Datadog Agent is sending to this endpoint
Additional Context
We’re running in AWS EKS 1.21 in
References
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 74 (41 by maintainers)
🙇 Sorry for the delay @neuronull ! I appreciate backporting this 🙇
I’m definitely still planning to 2x the env and run the test, just got caught up in a migration and went heads down on that. I’ll be able to come back to this on Wednesday or Thursday of this week!
Sounds good! Yeah I was beginning to look at https://github.com/vectordotdev/vector/pull/13973 since we only see the errors come up for distribution data, but I need a Rust pro to help out. That said, if any tagged releases can be pushed out that have any changes to test in our env, I can easily deploy them.
See y’all next year!
So sorry for the delay @neuronull, but I’m working through the nightlies now. Will get you a report this week so it’s ready for the new year 🙇
Just correcting myself- That change I was pointing out was included in v0.23.0, so that whole line of inquiry which followed would be invalid.
@jszwedko yeah, tried setting it to
falseand it didn’t change anything.That is the full configuration other than the Vector configuration in the original issue body, which I’ll re-paste below:
Hey! We’d definitely be interested to know if this fix resolves it for you too. Would you be able to try the latest nightly build? It will include this change.
@neuronull 👋 I’ve picked up this bug from Jon Winton at Cash. Since it looks like there’s a fix out for this, would you be able to provide us with a test image containing the fix that we can demo?
@neuronull amazing! Thanks for this! I’m going oncall for our team tomorrow and will definitely test it out then 😬
Thanks a bunch @jonwinton ! This is essentially what we expected to see.
I’ll dive into the performance of that algorithm.
Ok! Working on this now!
Thanks! No worries.
Yes, the key being to over provision by roughly 2x. If it auto scales up beyond that that is ok but the idea is exactly like you said, see if the CPU usage and errors / metric hits return to normal with having 2x or more Vector instances.
What release are you looking to have that backported into ?
v0.23? cc @jszwedko@neuronull we use in-app Prometheus clients to generate metrics that are then collected with the DataDog Agent OpenMetrics integration (docs). Then The DD Agent forwards them (though not sure the exact format) to Vector (docs).
Not at all! Bring them on 👍
Of course! Let me check this now!
@neuronull the HAProxy config is here, but pasting below without the Vector config:
So this comes up in our staging environment where we have the following:
When we deploy any SHA/version beyond v0.23.3, the HPA on Vector tries to run 100+ Vector pods and we still see requests consistently failing even when HPA scales out.
Great to hear that the build had the expected outcome!
Yes, the next steps are for us to figure out what is wrong with that commit 😃
We’ll definitely want to maintain the fix functionality. Good to know you also require that. Will keep this thread posted on progress~
Oh dang, looking at that PR more, we’re interested in maintaining the fix for this bug: https://github.com/vectordotdev/vector/issues/13870
We’re also dealing with
intervalissues and this would be helpful once we can safely upgrade 😬@neuronull confirmed that we don’t see the same issue with this version! Thank you for pushing that version out 🙇
I guess next steps would be entirely on y’all’s end?
Perfect! I’ll test this now
@jonwinton , sorry for the spam- Ignore that last comment’s instructions (there are some incorrect bits). We’re working on improving the procedure to follow. In the meantime, I’ll create the image and push it to the vector repo for you.
@neuronull dang, nice digging! 🙇
A private image works, or if you can give me the build commands for generating the libc image I can push an image into our private ECR repo. I tried digging into the build pipeline a bit, but my lack of Rust familiarity is holding me back a bit.
Hey! Thanks for all the details! I’ll try and answer everything here and will come back with deeper answers for some things I need to retrieve/update log levels for.
Let me go get some of those a little later. Pre-planned DR game day going on so a little distracted 😬
🤦 I’m sad I didn’t think about doing this already. I will definitely do this.
Multiple times!
Yeah, we jumped onto vector in the 0.1x versions and slowly bumped up each version until 0.24.0
One other piece of context that might be helpful: this issue first appeared in our largest staging environment, so I’m wondering if it’s related to volume. Are y’all running load testing benchmarks in CI? Screenshot to show the load where we’re first encountering the error.
@neuronull the version of the DataDog agent has been locked to
7.39.1for the duration of this test.@spencergilbert I think we’re going to be stuck on the autoscaling/v1 API for the next 3-6 months, so if it’s possible to support those versions that would be amazing 🙇
@neuronull here we go:
We’re looking into this, will keep you posted, @jonwinton !