vector: CPU increase and memory leak

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hello Vector team We see CPU increasement and a memory leak only in one of the agent pods. We are using the agent as DaemonSets --> a vector pod for each node. We are using the aggregator as StatefulSets. The agent is sending the k8s logs to the aggregator. Is that a bug? or did we configure it wrong?

image

Configuration

affinity: {}
  args:
  - -w
  - --config-dir
  - /etc/vector/
  - --log-format
  - json
  - -vv
  autoscaling:
    behavior: {}
    customMetric: {}
    enabled: false
    maxReplicas: 10
    minReplicas: 1
    targetCPUUtilizationPercentage: 80
  command: []
  commonLabels: {}
  containerPorts: []
  customConfig:
    api:
      enabled: false
    data_dir: /vector-data-dir
    sinks:
      prom_exporter:
        address: 0.0.0.0:9598
        inputs:
        - internal_metrics
        type: prometheus_exporter
      splunk_hec_dev_test:
        buffer:
          type: memory
        compression: gzip
        default_token: cf69e945-a1b2-a1b2-a1b2-be6777bea1b2
        encoding:
          codec: json
        endpoint: https://hec.a1b2.a234.log.cde.net.abc:443
        index: ugw_statistics
        inputs:
        - transform_remap_gateway_proxy_accesslog_test
        type: splunk_hec_logs
      vector_aggregator:
        address: abc-efgh-vector-aggregator.vector-aggregator.svc.cluster.local:7500
        inputs:
        - kubernetes_logs
        type: vector
    sources:
      internal_metrics:
        scrape_interval_secs: 2
        type: internal_metrics
      kubernetes_logs:
        glob_minimum_cooldown_ms: 1000
        max_read_bytes: 8192
        type: kubernetes_logs
    transforms:
      transform_remap_gateway_proxy_accesslog:
        inputs:
        - transform_route_gateway_proxy.access_log
        source: |
          ., err = parse_json(.message)
          if (err != null) {
            log("Remap gateway-proxy access_log, Unable to parse json: " + err, "error")
            abort
          }
          if (!exists(.targetLog) || .targetLog != "ABC-API-INSIGHTS-ACCESS-LOG") {
            abort
          }
          del(.targetLog)
          # verify insightsFields is not empty
          if (!exists(.insightsFields) || .insightsFields == null || .insightsFields == "") {
            log("Remap gateway-proxy access_log, insightsFields field is missing.")
            abort
          }
          result, err = merge(., .insightsFields)
          if (err != null) {
              log("Remap gateway-proxy access_log, Unable to merge insightsFields: " + err, "error")
          } else {
            . = result
          }
          del(.insightsFields)
          if (!exists(.insightsProfile) || .insightsProfile == null || .insightsProfile == "") {
            log("Remap gateway-proxy access_log, insightsProfile field is missing.")
            abort
          }
          path_parts, err = parse_regex(.path, r'(?P<pathname>[^?#]*)(?P<query>.*)')
          if (err != null) {
            log("Remap gateway-proxy access_log, Unable to parse regex: " + err, "error")
            abort
          }
          .path = path_parts.pathname
          if (.serviceTime != null) {
            .serviceTime = to_int!(.serviceTime)
          }
          if (.jwtZid != null) {
            .spcTenantId = .jwtZid
          }
          del(.jwtZid)
          if (.abcaaTenantId != null) {
            .spcTenantId = .abcaaTenantId
          }
          del(.abcaaTenantId)
          if (.jwtValStatus == null || .jwtSkipFailureReporting == true) {
            del(.jwtValStatus)
          }
          del(.jwtSkipFailureReporting)
          if (.rateLimitLimit == null) {
            del(.rateLimitLimit)
          }
          if (.rateLimitRemaining == null) {
            del(.rateLimitRemaining)
          }
          if (.rateLimitReset == null) {
            del(.rateLimitReset)
          }
          if (.rateLimitPolicy == null) {
            del(.rateLimitPolicy)
          }
        type: remap
      transform_remap_gateway_proxy_accesslog_all:
        inputs:
        - transform_remap_gateway_proxy_accesslog
        source: |
          if (.insightsProfile == "vector-e2e-test") {
            abort
          }
        type: remap
      transform_remap_gateway_proxy_accesslog_test:
        inputs:
        - transform_remap_gateway_proxy_accesslog
        source: |
          if (.insightsProfile != "vector-e2e-test") {
            abort
          }
          del(.insightsProfile)
        type: remap
      transform_route_gateway_proxy:
        inputs:
        - kubernetes_logs
        route:
          '*': .kubernetes.container_name == "gateway-proxy"
          access_log: starts_with(string!(.message), "{") && ends_with(string!(.message),
            "}")
        type: route
  dataDir: ""
  defaultVolumeMounts:
  - mountPath: /var/log/
    name: var-log
    readOnly: true
  - mountPath: /var/lib
    name: var-lib
    readOnly: true
  defaultVolumes:
  - hostPath:
      path: /var/log/
    name: var-log
  - hostPath:
      path: /var/lib/
    name: var-lib
  dnsConfig: {}
  dnsPolicy: ClusterFirst
  env: []
  envFrom: []
  existingConfigMaps: []
  extraContainers: []
  extraVolumeMounts: []
  extraVolumes:
  - name: secret-volume
    secret:
      secretName: vector-values
  fullnameOverride: ""
  global: {}
  haproxy:
    affinity: {}
    autoscaling:
      customMetric: {}
      enabled: false
      maxReplicas: 10
      minReplicas: 1
      targetCPUUtilizationPercentage: 80
    containerPorts: []
    customConfig: ""
    enabled: false
    existingConfigMap: ""
    extraContainers: []
    extraVolumeMounts: []
    extraVolumes: []
    image:
      pullPolicy: IfNotPresent
      pullSecrets: []
      repository: haproxytech/haproxy-alpine
      tag: 2.6.12
    initContainers: []
    livenessProbe:
      tcpSocket:
        port: 1024
    nodeSelector: {}
    podAnnotations: {}
    podLabels: {}
    podPriorityClassName: ""
    podSecurityContext: {}
    readinessProbe:
      tcpSocket:
        port: 1024
    replicas: 1
    resources: {}
    rollWorkload: true
    securityContext: {}
    service:
      annotations: {}
      externalTrafficPolicy: ""
      ipFamilies: []
      ipFamilyPolicy: ""
      loadBalancerIP: ""
      ports: []
      topologyKeys: []
      type: ClusterIP
    serviceAccount:
      annotations: {}
      automountToken: true
      create: true
    strategy: {}
    terminationGracePeriodSeconds: 60
    tolerations: []
  image:
    pullPolicy: IfNotPresent
    pullSecrets:
    - name: vector
    repository: build-releases-external.common.cdn.repositories.cloud.abc/timberio/vector
    tag: 0.32.1-distroless-libc
  ingress:
    annotations: {}
    className: ""
    enabled: false
    hosts: []
    tls: []
  initContainers: []
  lifecycle: {}
  livenessProbe: {}
  logLevel: info
  minReadySeconds: 0
  nameOverride: ""
  nodeSelector: {}
  persistence:
    accessModes:
    - ReadWriteOnce
    enabled: false
    existingClaim: ""
    finalizers:
    - kubernetes.io/pvc-protection
    hostPath:
      enabled: true
      path: /var/lib/vector
    selectors: {}
    size: 10Gi
  podAnnotations: {}
  podDisruptionBudget:
    enabled: false
    minAvailable: 1
  podHostNetwork: false
  podLabels:
    sidecar.istio.io/inject: "true"
    vector.dev/exclude: "true"
  podManagementPolicy: OrderedReady
  podMonitor:
    additionalLabels: {}
    enabled: false
    honorLabels: false
    honorTimestamps: true
    jobLabel: app.kubernetes.io/name
    metricRelabelings: []
    path: /metrics
    port: prom-exporter
    relabelings: []
  podPriorityClassName: ""
  podSecurityContext: {}
  psp:
    create: false
  rbac:
    create: true
  readinessProbe: {}
  replicas: 1
  resources:
    limits:
      cpu: 1000m
      memory: 2Gi
    requests:
      cpu: 10m
      memory: 128Mi
  role: Agent
  rollWorkload: false
  secrets:
    generic: {}
  securityContext: {}
  service:
    annotations: {}
    enabled: true
    externalTrafficPolicy: ""
    ipFamilies: []
    ipFamilyPolicy: ""
    loadBalancerIP: ""
    ports: []
    topologyKeys: []
    type: ClusterIP
  serviceAccount:
    annotations: {}
    automountToken: true
    create: true
  serviceHeadless:
    enabled: true
  terminationGracePeriodSeconds: 60
  tolerations:
  - effect: NoSchedule
    key: WorkGroup
    operator: Equal
    value: abproxy
  - effect: NoExecute
    key: WorkGroup
    operator: Equal
    value: abproxy
  topologySpreadConstraints: []
  ugwSecretConfigEnabled: false
  updateStrategy: {}
  workloadResourceAnnotations: {}

Version

0.32.1-distroless-libc

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

About this issue

  • Original URL
  • State: open
  • Created 8 months ago
  • Reactions: 6
  • Comments: 21 (9 by maintainers)

Most upvoted comments

Thanks, yes, an upgrade is certainly an option, we’ll try it

No problem! The input of the transform is internal_metrics and prometheus_exporter consumes it. You are correct that at this point the metric is already collected, and I believe you are correct that dropping it after internal_metrics does not lead to vector memory reduction. The only reason I did this is to limit the cardinality of these metrics stored in prometheus, after the metrics have been scraped from the vector pod.

I did some investigation into this, and my findings agree with @jszwedko’s comment above. I also confirm using expire_metrics_secs fixed for me in v0.31 (and presumably other versions below 0.35).

In our case, a few nodes in the cluster had much heavier pod churn than the rest of the nodes, and the vector agents we noticed with significantly increasing CPU and mem usage were exclusively on those nodes. One feature I noticed with internal_metrics is that the vector_component_received_*_total metrics include pod_name as a label. When the vector agent is on a node with heavy pod churn, the cardinality of this metric grows as every new pod is created.

I expected this might cause extra load on prometheus, but not vector itself. However, it appears Vector’s default behavior (at least in v0.31) is that metrics will continue to emit for every pod that has ever existed on its node. I could see this reflected in the rate at which events were sent to from internal_metrics to the pod’s prometheus exporter server. Rather than being constant, this rate was increasing. I believe this is the cause of the memory (and CPU) increasing over time; the amount of data sent from the internal_metrics source increases with every new pod creation, but never decreases with an old pod’s deletion.

Screenshot 2024-01-25 at 11 13 44 AM

To fix: I set the (by default unset) global option expire_metrics_secs, as suggested by @jszwedko.

I also created a transform that dropped the pod_name tag from the metrics that feature it. (I also made an analogous transform for the metrics which have one file tag for each pod). I did this because I am not currently using the pod_name label at all in my dashboards or alerts. This is not necessary for controlling cpu and memory growth once you enable expire_metrics_secs, but I had no use for that label and it helps with reducing load on prometheus.

Mem utilization before and after: Screenshot 2024-01-29 at 9 46 30 AM

CPU utilization before and after: Screenshot 2024-01-29 at 9 54 14 AM

(In both of these graphs, the very steep line before the fix is the node with very high pod churn)

Do you see the cardinality of the metrics exposed by the prometheus_exporter sink growing over time? That’s one hunch I would have: that the sink is receiving ever more metrics series.

This could be due to some components publishing telemetry with unbounded cardinality (like the file source which tagged internal metrics with a file tag). In v0.35.0, these “high cardinality” tags were removed or changed to be opt-in. Could you try that version?

There is also https://vector.dev/docs/reference/configuration/global-options/#expire_metrics_secs which you can use to expire stale metric contexts.

Hello @pront Upgraded to version 0.34.1 . We see an improvement, but can still see increment in memory & cpu,

image

Seems like the upgrade to 0.34.1 resolved the issue with the memory leak & CPU increase. Screen Shot 2023-11-22 at 10 34 40 AM

We upgraded to version 0.34.0, but still see increment in cpu & memory.

image

Hmm, I keyed on the Prometheus component in the config above and misread it as a source (so used to reading sources first). I see you don’t have any metric sources other than internal_metrics, which should have effectively single cardinality for its metrics, so that was a red herring. However, there were a couple of buffer-related memory leaks that were addressed between version 0.32.0 and 0.34.0. Are you able to upgrade to 0.34.0?