opentelemetry-collector-contrib: Performance Regression Upgrading from v0.71.0 -> v0.73.0

Component(s)

No response

What happened?

Description

Last week we rolled an upgraded version of OpenTelmetry Contrib to our production environment and started having memory and CPU problems resulting in dropped metrics and high CPU alarms.

We upgraded straight from 71 to 73 no idea which version was the origination of the regression

Steps to Reproduce

Expected Result

Actual Result

You can see from the ecs metrics below where the cpu and memory drops is where we rolled back to v0.71.0.

The blank prior to the lines was the sidecar dropping data due to being overwhelmed

Collector version

v0.73.0

Environment information

Environment

OS: ECS Fargate Linux

OpenTelemetry Collector configuration

---
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: sidecar
          scrape_interval: 60s
          metrics_path: /metrics
          scheme: http
          static_configs:
            - targets:
                - localhost:8080
        - job_name: otel
          scrape_interval: 60s
          metrics_path: /metrics
          scheme: http
          static_configs:
            - targets: ["localhost:8888"]

  fluentforward/net:
    endpoint: 0.0.0.0:34334

  fluentforward/sock:
    endpoint: unix://var/run/fluent.sock

  awsecscontainermetrics:
    collection_interval: 60s

  statsd:

  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  jaeger:
    protocols:
      grpc:
      thrift_http:
      thrift_compact:
      thrift_binary:

  zipkin:


exporters:
  logging:
    verbosity: basic

  sumologic:
    endpoint: <url>
    compress_encoding: "gzip"
    max_request_body_size: 1_048_576 # 1MB
    log_format: "json"
    metric_format: "prometheus"
    source_category: ${ENVIRONMENT}/${SERVICE_NAME}
    source_name: ${SERVICE_NAME}
    # source_host: ${TASK}

  loki:
    endpoint: <url>
    format: json
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000
    labels:
      attributes:
        team:
        service:
      resource:
        account:
        region:
        environment:

  otlphttp/tempo:
    endpoint: <url>

  prometheusremotewrite/mimir:
    endpoint: <url>
    resource_to_telemetry_conversion:
      enabled: true

  prometheus:
    endpoint: 0.0.0.0:9273
    namespace: ${SERVICE_NAME}
    resource_to_telemetry_conversion:
      enabled: true

{{- $resourceDetection := map "aws.ecs.task.launch_type" "launch_type" "aws.ecs.task.version" "task_version" "cloud.account.id" "account" "cloud.region" "region" "cloud.subnet" "subnet" "cloud.vpc" "vpc" "cloud.availability_zone" "availability_zone" "cloud.availability_zone.id" "availability_zone_id" "aws.ecs.cluster.arn" "" "aws.ecs.task.arn" "" "aws.ecs.task.family" "task_family" "aws.ecs.task.revision" "task_revision" }}
{{- $metricAttributeRemap := map "aws.ecs.task.revision" "task_revision" "cloud.availability_zone" "availability_zone" "cloud.account.id" "account" "cloud.region" "region" }}
{{- $deleteFields := split "aws.ecs.task.pull.started_at,aws.ecs.task.known_status,aws.ecs.task.arn,aws.ecs.cluster.name,aws.ecs.task.pull_started_at,aws.ecs.task.pull_stopped_at,aws.ecs.service.name,container.id,aws.ecs.docker.name,aws.ecs.container.image.id,aws.ecs.container.exit_code,aws.ecs.container.created_at,aws.ecs.container.know_status,aws.ecs.container.image.id,aws.ecs.container.started_at,aws.ecs.container.finished_at,aws.ecs.launchtype" "," }}

processors:
  resource:
    attributes:
      - key: subnet
        action: insert
        value: ${SUBNET_ID}

      - key: environment
        action: upsert
        value: ${ENVIRONMENT}

      - key: cloud.availability_zone.id
        action: insert
        value: ${AVAILABILITY_ZONE_ID}

      - key: subnet
        action: insert
        value: ${SUBNET_ID}

      - key: vpc
        action: insert
        value: ${VPC}

      - key: aws.ecs.task.arn
        action: extract
        pattern: \/(?P<task_id>\w+)$
      - key: aws.ecs.task.id
        action: delete

      - key: aws.ecs.cluster.arn
        action: extract
        pattern: \/(?P<cluster_name>\w+)$
      - key: aws.ecs.cluster.arn
        action: delete

      # Remap resource detection fields
      {{- range $k, $v := $resourceDetection }}
      {{- if ne $v "" }}
      - key: {{ $v }}
        from_attribute: {{ $k }}
        action: upsert
      - key: {{ $k }}
        action: delete
      {{- end }}
      {{- end }}

      # Delete Resource Fields
      {{- range $k, $v := $deleteFields }}
      {{- if ne $v "" }}
      - key: {{ $v }}
        action: delete
      {{- end }}
      {{- end }}

  attributes:
    actions:
      - key: service
        action: upsert
        value: ${SERVICE_NAME}

      - key: team
        action: upsert
        value: ${BILLING_TEAM}

      {{- range $k, $v := $metricAttributeRemap }}
      {{- if ne $v "" }}
      - key: {{ $v }}
        from_attribute: {{ $k }}
        action: upsert
      - key: {{ $k }}
        action: delete
      {{- end }}
      {{- end }}

      {{- range $k, $v := $deleteFields }}
      {{- if ne $v "" }}
      - key: {{ $v }}
        action: delete
      {{- end }}
      {{- end }}

  resource/cloud:
    attributes:
      - key: cloud.availability_zone.id
        action: insert
        value: ${AVAILABILITY_ZONE_ID}
      - key: cloud.subnet.id
        action: insert
        value: ${SUBNET_ID}
      - key: cloud.vpc.id
        action: insert
        value: ${VPC}
      - key: route.environment
        action: insert
        value: ${ENVIRONMENT}
      - key: route.service
        action: insert
        value: ${SERVICE_NAME}
      - key: route.billing.team
        action: insert
        value: ${BILLING_TEAM}
      - key: aws.ecs.task.arn
        action: extract
        pattern: \/(?P<task_id>\w+)$
      - key: aws.ecs.task.id
        action: insert
        from_attribute: task_id
      - key: aws.ecs.cluster.arn
        action: extract
        pattern: \/(?P<cluster_name>\w+)$
      - key: aws.ecs.cluster.name
        action: insert
        from_attribute: cluster_name
      - key: aws.ecs.task.pull_started_at
        action: delete
      - key: aws.ecs.task.pull_stopped_at
        action: delete
      - key: aws.ecs.task.known_status
        action: delete
      - key: aws.ecs.launch_type
        action: delete
      - key: aws.ecs.container.created_at
        action: delete
      - key: aws.ecs.container.started_at
        action: delete
      - key: aws.ecs.container.finished_at
        action: delete
      - key: aws.ecs.container.know_status
        action: delete
      - key: aws.ecs.docker.name
        action: delete
      - key: aws.ecs.container.image.id
        action: delete
      - key: aws.ecs.container.exit_code
        action: delete

  attributes/cleanup:
    actions:
      - key: cluster_name
        action: delete
      - key: ecs_task_definition
        action: delete
      - key: fluent.tag
        action: delete
      - key: ecs_task_arn
        action: delete
      - key: task_id
        action: delete

  resourcedetection/ecs:
    detectors: [env, ecs]
    timeout: 2s
    override: false
    attributes:
      {{- range $k, $v := $resourceDetection }}
      - {{ $k }}
      {{- end }}

  probabilistic_sampler:
    hash_seed: 13
    sampling_percentage: 10

  probabilistic_sampler/tempo:
    hash_seed: 13
    sampling_percentage: 100

  memory_limiter:
    check_interval: 5s
    limit_mib: 128
    spike_limit_mib: 0

  batch:
    timeout: 200ms

extensions:
  memory_ballast:
    size_mib: 64

service:
  extensions: [memory_ballast]
  telemetry:
    logs:
      level: warn
  pipelines:
    metrics:
      receivers: [prometheus, statsd, otlp]
      processors: [memory_limiter, resourcedetection/ecs, resource/cloud, attributes/cleanup]
      exporters: [prometheus, sumologic]
    metrics/mimir:
      receivers: [awsecscontainermetrics, prometheus, statsd, otlp]
      processors: [memory_limiter, resourcedetection/ecs, resource, attributes, batch]
      exporters: [prometheusremotewrite/mimir]
    logs:
      receivers: [fluentforward/net, fluentforward/sock]
      processors: [memory_limiter, resourcedetection/ecs, resource/cloud, attributes/cleanup]
      exporters: [sumologic]
    logs/loki:
      receivers: [fluentforward/net, fluentforward/sock]
      processors: [memory_limiter, resourcedetection/ecs, resource, attributes, batch]
      exporters: [loki]
    traces:
      receivers: [otlp, zipkin, jaeger]
      processors: [memory_limiter, probabilistic_sampler, batch, resourcedetection/ecs, resource/cloud, attributes/cleanup]
      exporters: [logging]
    traces/tempo:
      receivers: [otlp, zipkin, jaeger]
      processors: [memory_limiter, probabilistic_sampler/tempo, resourcedetection/ecs, batch]
      exporters: [otlp/tempo]

Log output

No response

Additional context

I use a confd based entrypoint in my container to render the config hence the metadata templating pieces.

We utilize the APK published by the contrib project for alpine

About this issue

Original URL
State: closed
Created a year ago
Comments: 21 (14 by maintainers)

Most upvoted comments

Ah, I mis-interpreted your comment. I’ve actually already tested that scenario and it works as expected. I’ll submit a PR.

TylerHelmuth on Apr 5, 2023

@jpkrohling I’m pretty sure it’s actually the fluent forward receiver, as this was changed between these releases. I’m trying some details from pprof.

dlahn on Apr 5, 2023