vector: Datadog Agent Source Regression in v0.24x

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We noticed the following issue when upgrading from v0.23.3 to v0.24.0:

(1) CPU became spikey/less consistent

The charts below show how CPU explodes, Datadog forwarder error rates climb, HAProxy 5xx rates climb

Screenshot 2022-11-17 at 3 24 52 PM

(2) 504 error codes coming from the Datadog Agents writing to Vector for only the /api/beta/sketches endpoint

2022-11-15 22:52:12 UTC | CORE | ERROR | (pkg/forwarder/worker.go:184 in process) | Error while processing transaction: error "504 Gateway Time-out" while sending transaction to "http://vector-haproxy.vector.svc.cluster.local:6000/api/beta/sketches", rescheduling it: "<html><body><h1>504 Gateway Time-out</h1>\nThe server didn't respond in time.\n</body></html>\n"

This is also apparent in errors surfacing from HAProxy (deployed via the Vector Helm chart). HAProxy is using a leastconn balance strategy.

(3) Error logs coming from Vector for shutting down connections

{"host":"vector-599576bd9b-w2bq7","message":"error shutting down IO: Transport endpoint is not connected (os error 107)","metadata":{"kind":"event","level":"DEBUG","module_path":"hyper::proto::h1::conn","target":"hyper::proto::h1::conn"},"pid":1,"source_ty
pe":"internal_logs","timestamp":"2022-11-16T20:11:46.533762489Z"}
{"host":"vector-599576bd9b-w2bq7","message":"connection error: error shutting down connection: Transport endpoint is not connected (os error 107)","metadata":{"kind":"event","level":"DEBUG","module_path":"hyper::server::server::new_svc","target":"hyper::se
rver::server::new_svc"},"pid":1,"source_type":"internal_logs","timestamp":"2022-11-16T20:11:46.533777847Z"}

Configuration

data_dir: /vector-data-dir

api:
  enabled: true
  address: 127.0.0.1:8686
  playground: false

sources:
  internal_logs:
    type: internal_logs

  # Datadog Agent telemetry
  datadog_agent:
    type: datadog_agent
    address: "0.0.0.0:6000"
    multiple_outputs: true # To automatically separate metrics and logs

sinks:
  console:
    type: console
    inputs:
      - internal_logs
    target: stdout
    encoding:
      codec: json

  # Datadog metrics output
  datadog_metrics:
    type: datadog_metrics
    inputs:
      - <inputs>...
    api_key: "${DATADOG_API_KEY}"

Version

0.24.0-distroless-libc

Debug Output

I can only recreate this issue in critical environments where I can't create this output information :(

Example Data

I’m not sure what the Datadog Agent is sending to this endpoint

Additional Context

We’re running in AWS EKS 1.21 in

References

No response

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 3
Comments: 74 (41 by maintainers)

Most upvoted comments

🙇 Sorry for the delay @neuronull ! I appreciate backporting this 🙇

I’m definitely still planning to 2x the env and run the test, just got caught up in a migration and went heads down on that. I’ll be able to come back to this on Wednesday or Thursday of this week!

jonwinton on Jan 30, 2023

Sounds good! Yeah I was beginning to look at https://github.com/vectordotdev/vector/pull/13973 since we only see the errors come up for distribution data, but I need a Rust pro to help out. That said, if any tagged releases can be pushed out that have any changes to test in our env, I can easily deploy them.

See y’all next year!

jonwinton on Dec 27, 2022

So sorry for the delay @neuronull, but I’m working through the nightlies now. Will get you a report this week so it’s ready for the new year 🙇

jonwinton on Dec 27, 2022

In v0.24.0 #13406 contained a fix in the datadog_metrics sink.

Just correcting myself- That change I was pointing out was included in v0.23.0, so that whole line of inquiry which followed would be invalid.

neuronull on Dec 2, 2022

@jszwedko yeah, tried setting it to false and it didn’t change anything.

That is the full configuration other than the Vector configuration in the original issue body, which I’ll re-paste below:

data_dir: /vector-data-dir

api:
  enabled: true
  address: 127.0.0.1:8686
  playground: false

sources:
  internal_logs:
    type: internal_logs

  # Datadog Agent telemetry
  datadog_agent:
    type: datadog_agent
    address: "0.0.0.0:6000"
    multiple_outputs: true # To automatically separate metrics and logs
    store_api_key: false

sinks:
  console:
    type: console
    inputs:
      - internal_logs
    target: stdout
    encoding:
      codec: json

  # Datadog metrics output
  datadog_metrics:
    type: datadog_metrics
    inputs:
      - <inputs>...
    api_key: "${DATADOG_API_KEY}"

jonwinton on Nov 29, 2022

@neuronull 👋 I’ve picked up this bug from Jon Winton at Cash. Since it looks like there’s a fix out for this, would you be able to provide us with a test image containing the fix that we can demo?

Hey! We’d definitely be interested to know if this fix resolves it for you too. Would you be able to try the latest nightly build? It will include this change.

jszwedko on Oct 18, 2023

@neuronull 👋 I’ve picked up this bug from Jon Winton at Cash. Since it looks like there’s a fix out for this, would you be able to provide us with a test image containing the fix that we can demo?

aashery-square on Oct 18, 2023

@neuronull amazing! Thanks for this! I’m going oncall for our team tomorrow and will definitely test it out then 😬

jonwinton on Mar 7, 2023

Thanks a bunch @jonwinton ! This is essentially what we expected to see.

I’ll dive into the performance of that algorithm.

neuronull on Feb 9, 2023

Ok! Working on this now!

jonwinton on Feb 9, 2023

yeah I can do that, but it might need to wait until tomorrow.

Thanks! No worries.

Just to confirm the test: we want to run it at steady state for a set amount of time scaled up to see if the CPU usage and error rates return to normal?

Yes, the key being to over provision by roughly 2x. If it auto scales up beyond that that is ok but the idea is exactly like you said, see if the CPU usage and errors / metric hits return to normal with having 2x or more Vector instances.

Also, would it be possible to backport the interval fix (https://github.com/vectordotdev/vector/issues/15292#issuecomment-1372477689) while we continue to work on this? The interval issue is causing a lot of issues for us and it would be great to have a fix for that 🤞

What release are you looking to have that backported into ? v0.23 ? cc @jszwedko

neuronull on Jan 19, 2023

@neuronull we use in-app Prometheus clients to generate metrics that are then collected with the DataDog Agent OpenMetrics integration (docs). Then The DD Agent forwards them (though not sure the exact format) to Vector (docs).

jonwinton on Jan 13, 2023

Not at all! Bring them on 👍

jonwinton on Jan 12, 2023

Of course! Let me check this now!

jonwinton on Jan 12, 2023

@neuronull the HAProxy config is here, but pasting below without the Vector config:

haproxy:
  enabled: true
  image:
    tag: &version 2.6.6
  resources:
    requests:
      cpu: 1000m
      memory: 1Gi
    limits:
      cpu: 1000m
      memory: 1Gi

  autoscaling:
    enabled: true
    minReplicas: 9
    maxReplicas: 50

  podLabels:
    tags.datadoghq.com/service: vector-haproxy
    tags.datadoghq.com/version: *version

  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - vector
            topologyKey: failure-domain.beta.kubernetes.io/zone
          weight: 100

  podAnnotations:
    ad.datadoghq.com/haproxy.checks: |
      {
        "haproxy": {
          "init_config": {},
          "instances": [{
            "use_openmetrics": true,
            "openmetrics_endpoint": "http://%%host%%:1024/metrics",
            "histogram_buckets_as_distributions": true,
            "collect_counters_with_distributions": true
          }]
        }
      }

  customConfig: |
    global
      log stdout format raw local0
      maxconn 4096
      stats socket /tmp/haproxy
      hard-stop-after {{ .Values.haproxy.terminationGracePeriodSeconds }}s

    defaults
      log     global
      option  dontlognull
      retries 10
      option  redispatch
      option  allbackups
      timeout client 10s
      timeout server 10s
      timeout connect 5s

    resolvers coredns
      nameserver dns1 kube-dns.kube-system.svc.cluster.local:53
      resolve_retries 3
      timeout resolve 2s
      timeout retry 1s
      accepted_payload_size 8192
      hold valid 10s
      hold obsolete 60s

    frontend stats
      mode http
      bind :::1024
      option httplog
      http-request use-service prometheus-exporter if { path /metrics }
      stats enable
      stats hide-version  # Hide HAProxy version
      stats realm Haproxy\ Statistics  # Title text for popup window
      stats uri /haproxy_stats  # Stats URI
      stats refresh 10s


    frontend datadog-agent
      mode http
      bind :::6000
      option httplog
      option dontlog-normal
      default_backend datadog-agent

    backend datadog-agent
      mode http
      balance leastconn
      option tcp-check
      server-template srv 10 _datadog-agent._tcp.{{ include "vector.fullname" $ }}.{{ $.Release.Namespace }}.svc.cluster.local resolvers coredns check

'm just curious if you could describe how many instances you have of the Datadog Agent running, and how many instances of Vector running?

So this comes up in our staging environment where we have the following:

~150 Nodes all running a DataDog Agent
~20 Vector pods running on v0.23.3. This is usually less but we’re scaled out a bit for testing some transforms.
9 HAProxy pods

When we deploy any SHA/version beyond v0.23.3, the HPA on Vector tries to run 100+ Vector pods and we still see requests consistently failing even when HPA scales out.

jonwinton on Jan 9, 2023

Great to hear that the build had the expected outcome!

Yes, the next steps are for us to figure out what is wrong with that commit 😃

We’ll definitely want to maintain the fix functionality. Good to know you also require that. Will keep this thread posted on progress~

neuronull on Jan 5, 2023

Oh dang, looking at that PR more, we’re interested in maintaining the fix for this bug: https://github.com/vectordotdev/vector/issues/13870

We’re also dealing with interval issues and this would be helpful once we can safely upgrade 😬

jonwinton on Jan 5, 2023

@neuronull confirmed that we don’t see the same issue with this version! Thank you for pushing that version out 🙇

I guess next steps would be entirely on y’all’s end?

jonwinton on Jan 5, 2023

Perfect! I’ll test this now

jonwinton on Jan 5, 2023

@jonwinton , sorry for the spam- Ignore that last comment’s instructions (there are some incorrect bits). We’re working on improving the procedure to follow. In the meantime, I’ll create the image and push it to the vector repo for you.

neuronull on Jan 5, 2023

@neuronull dang, nice digging! 🙇

A private image works, or if you can give me the build commands for generating the libc image I can push an image into our private ECR repo. I tried digging into the build pipeline a bit, but my lack of Rust familiarity is holding me back a bit.

jonwinton on Jan 4, 2023

Hey! Thanks for all the details! I’ll try and answer everything here and will come back with deeper answers for some things I need to retrieve/update log levels for.

Do you have any additional log snippets you could provide? For example, is Vector logging any Errors (not debug logs). And, perhaps the error logs from HAProxy?

Let me go get some of those a little later. Pre-planned DR game day going on so a little distracted 😬

Since we’re having trouble reproducing it, would you be open to trying some nightly builds between v0.23.3 and 0.24.0, and observing the first build which exhibits the issue? This could give us hints to which commits might have caused the issue and additionally raise confidence in the issue.

🤦 I’m sad I didn’t think about doing this already. I will definitely do this.

Have you attempted this upgrade from 0.23.3 to 0.24.0 multiple times or just once?

Multiple times!

Have you performed similar upgrades of Vector in this environment before?

Yeah, we jumped onto vector in the 0.1x versions and slowly bumped up each version until 0.24.0

One other piece of context that might be helpful: this issue first appeared in our largest staging environment, so I’m wondering if it’s related to volume. Are y’all running load testing benchmarks in CI? Screenshot to show the load where we’re first encountering the error.

Screenshot 2022-12-07 at 8 24 48 AM

jonwinton on Dec 7, 2022

Forgot to ask as well- what version of the Datadog Agent are you using? Was it also upgraded or did it’s version stay the same? Can you share the DD Agent config as well?

@neuronull the version of the DataDog agent has been locked to 7.39.1 for the duration of this test.

We’d be happy to get an issue opened for that support if that would help you upgrade.

@spencergilbert I think we’re going to be stuck on the autoscaling/v1 API for the next 3-6 months, so if it’s possible to support those versions that would be amazing 🙇

jonwinton on Nov 28, 2022

@neuronull here we go:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - vector
          topologyKey: failure-domain.beta.kubernetes.io/zone
        weight: 100
autoscaling:
  enabled: true
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 60
commonLabels:
  tags.datadoghq.com/service: vector
  tags.datadoghq.com/version: "0.24.0"
env:
- name: DATADOG_API_KEY
  valueFrom:
    secretKeyRef:
      key: ...
      name: ...
- name: DD_HOSTNAME
  valueFrom:
    fieldRef:
      fieldPath: spec.nodeName
- name: VECTOR_POD_NAME
  valueFrom:
    fieldRef:
      fieldPath: metadata.name
- name: VECTOR_VERSION
  valueFrom:
    fieldRef:
      fieldPath: metadata.labels['tags.datadoghq.com/version']
extraVolumeMounts:
- mountPath: /mnt/secrets-store
  name: vector-secrets
  readOnly: true
extraVolumes:
- csi:
    driver: secrets-store.csi.k8s.io
    readOnly: true
    volumeAttributes:
      secretProviderClass: ...
  name: vector-secrets
podDisruptionBudget:
  enabled: true
  minAvailable: 10%
podLabels:
  tags.datadoghq.com/service: vector
podPriorityClassName: system-cluster-critical
role: Stateless-Aggregator
rollWorkload: true
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: <role id>
  create: true
updateStrategy:
  rollingUpdate:
    maxSurge: 0
    maxUnavailable: 1
  type: RollingUpdate
haproxy:
  enabled: true
  image:
    tag: &version 2.6.6
  resources:
    requests:
      cpu: 1000m
      memory: 1Gi
    limits:
      cpu: 1000m
      memory: 1Gi

  autoscaling:
    enabled: true
    minReplicas: 9
    maxReplicas: 50

  podLabels:
    tags.datadoghq.com/service: vector-haproxy
    tags.datadoghq.com/version: *version

  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - vector
            topologyKey: failure-domain.beta.kubernetes.io/zone
          weight: 100

  podAnnotations:
    ad.datadoghq.com/haproxy.checks: |
      {
        "haproxy": {
          "init_config": {},
          "instances": [{
            "use_openmetrics": true,
            "openmetrics_endpoint": "http://%%host%%:1024/metrics",
            "histogram_buckets_as_distributions": true,
            "collect_counters_with_distributions": true
          }]
        }
      }

  customConfig: |
    global
      log stdout format raw local0
      maxconn 4096
      stats socket /tmp/haproxy
      hard-stop-after {{ .Values.haproxy.terminationGracePeriodSeconds }}s

    defaults
      log     global
      option  dontlognull
      retries 10
      option  redispatch
      option  allbackups
      timeout client 10s
      timeout server 10s
      timeout connect 5s

    resolvers coredns
      nameserver dns1 kube-dns.kube-system.svc.cluster.local:53
      resolve_retries 3
      timeout resolve 2s
      timeout retry 1s
      accepted_payload_size 8192
      hold valid 10s
      hold obsolete 60s

    frontend stats
      mode http
      bind :::1024
      option httplog
      http-request use-service prometheus-exporter if { path /metrics }
      stats enable
      stats hide-version  # Hide HAProxy version
      stats realm Haproxy\ Statistics  # Title text for popup window
      stats uri /haproxy_stats  # Stats URI
      stats refresh 10s


    frontend datadog-agent
      mode http
      bind :::6000
      option httplog
      option dontlog-normal
      default_backend datadog-agent

    backend datadog-agent
      mode http
      balance leastconn
      option tcp-check
      server-template srv 10 _datadog-agent._tcp.{{ include "vector.fullname" $ }}.{{ $.Release.Namespace }}.svc.cluster.local resolvers coredns check

jonwinton on Nov 21, 2022

We’re looking into this, will keep you posted, @jonwinton !

barieom on Nov 18, 2022