opentelemetry-collector-contrib: Performance Regression Upgrading from v0.71.0 -> v0.73.0
Component(s)
No response
What happened?
Description
Last week we rolled an upgraded version of OpenTelmetry Contrib to our production environment and started having memory and CPU problems resulting in dropped metrics and high CPU alarms.
We upgraded straight from 71 to 73 no idea which version was the origination of the regression
Steps to Reproduce
Expected Result
Actual Result
You can see from the ecs metrics below where the cpu and memory drops is where we rolled back to v0.71.0.
The blank prior to the lines was the sidecar dropping data due to being overwhelmed
Collector version
v0.73.0
Environment information
Environment
OS: ECS Fargate Linux
OpenTelemetry Collector configuration
---
receivers:
prometheus:
config:
scrape_configs:
- job_name: sidecar
scrape_interval: 60s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- localhost:8080
- job_name: otel
scrape_interval: 60s
metrics_path: /metrics
scheme: http
static_configs:
- targets: ["localhost:8888"]
fluentforward/net:
endpoint: 0.0.0.0:34334
fluentforward/sock:
endpoint: unix://var/run/fluent.sock
awsecscontainermetrics:
collection_interval: 60s
statsd:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
jaeger:
protocols:
grpc:
thrift_http:
thrift_compact:
thrift_binary:
zipkin:
exporters:
logging:
verbosity: basic
sumologic:
endpoint: <url>
compress_encoding: "gzip"
max_request_body_size: 1_048_576 # 1MB
log_format: "json"
metric_format: "prometheus"
source_category: ${ENVIRONMENT}/${SERVICE_NAME}
source_name: ${SERVICE_NAME}
# source_host: ${TASK}
loki:
endpoint: <url>
format: json
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
labels:
attributes:
team:
service:
resource:
account:
region:
environment:
otlphttp/tempo:
endpoint: <url>
prometheusremotewrite/mimir:
endpoint: <url>
resource_to_telemetry_conversion:
enabled: true
prometheus:
endpoint: 0.0.0.0:9273
namespace: ${SERVICE_NAME}
resource_to_telemetry_conversion:
enabled: true
{{- $resourceDetection := map "aws.ecs.task.launch_type" "launch_type" "aws.ecs.task.version" "task_version" "cloud.account.id" "account" "cloud.region" "region" "cloud.subnet" "subnet" "cloud.vpc" "vpc" "cloud.availability_zone" "availability_zone" "cloud.availability_zone.id" "availability_zone_id" "aws.ecs.cluster.arn" "" "aws.ecs.task.arn" "" "aws.ecs.task.family" "task_family" "aws.ecs.task.revision" "task_revision" }}
{{- $metricAttributeRemap := map "aws.ecs.task.revision" "task_revision" "cloud.availability_zone" "availability_zone" "cloud.account.id" "account" "cloud.region" "region" }}
{{- $deleteFields := split "aws.ecs.task.pull.started_at,aws.ecs.task.known_status,aws.ecs.task.arn,aws.ecs.cluster.name,aws.ecs.task.pull_started_at,aws.ecs.task.pull_stopped_at,aws.ecs.service.name,container.id,aws.ecs.docker.name,aws.ecs.container.image.id,aws.ecs.container.exit_code,aws.ecs.container.created_at,aws.ecs.container.know_status,aws.ecs.container.image.id,aws.ecs.container.started_at,aws.ecs.container.finished_at,aws.ecs.launchtype" "," }}
processors:
resource:
attributes:
- key: subnet
action: insert
value: ${SUBNET_ID}
- key: environment
action: upsert
value: ${ENVIRONMENT}
- key: cloud.availability_zone.id
action: insert
value: ${AVAILABILITY_ZONE_ID}
- key: subnet
action: insert
value: ${SUBNET_ID}
- key: vpc
action: insert
value: ${VPC}
- key: aws.ecs.task.arn
action: extract
pattern: \/(?P<task_id>\w+)$
- key: aws.ecs.task.id
action: delete
- key: aws.ecs.cluster.arn
action: extract
pattern: \/(?P<cluster_name>\w+)$
- key: aws.ecs.cluster.arn
action: delete
# Remap resource detection fields
{{- range $k, $v := $resourceDetection }}
{{- if ne $v "" }}
- key: {{ $v }}
from_attribute: {{ $k }}
action: upsert
- key: {{ $k }}
action: delete
{{- end }}
{{- end }}
# Delete Resource Fields
{{- range $k, $v := $deleteFields }}
{{- if ne $v "" }}
- key: {{ $v }}
action: delete
{{- end }}
{{- end }}
attributes:
actions:
- key: service
action: upsert
value: ${SERVICE_NAME}
- key: team
action: upsert
value: ${BILLING_TEAM}
{{- range $k, $v := $metricAttributeRemap }}
{{- if ne $v "" }}
- key: {{ $v }}
from_attribute: {{ $k }}
action: upsert
- key: {{ $k }}
action: delete
{{- end }}
{{- end }}
{{- range $k, $v := $deleteFields }}
{{- if ne $v "" }}
- key: {{ $v }}
action: delete
{{- end }}
{{- end }}
resource/cloud:
attributes:
- key: cloud.availability_zone.id
action: insert
value: ${AVAILABILITY_ZONE_ID}
- key: cloud.subnet.id
action: insert
value: ${SUBNET_ID}
- key: cloud.vpc.id
action: insert
value: ${VPC}
- key: route.environment
action: insert
value: ${ENVIRONMENT}
- key: route.service
action: insert
value: ${SERVICE_NAME}
- key: route.billing.team
action: insert
value: ${BILLING_TEAM}
- key: aws.ecs.task.arn
action: extract
pattern: \/(?P<task_id>\w+)$
- key: aws.ecs.task.id
action: insert
from_attribute: task_id
- key: aws.ecs.cluster.arn
action: extract
pattern: \/(?P<cluster_name>\w+)$
- key: aws.ecs.cluster.name
action: insert
from_attribute: cluster_name
- key: aws.ecs.task.pull_started_at
action: delete
- key: aws.ecs.task.pull_stopped_at
action: delete
- key: aws.ecs.task.known_status
action: delete
- key: aws.ecs.launch_type
action: delete
- key: aws.ecs.container.created_at
action: delete
- key: aws.ecs.container.started_at
action: delete
- key: aws.ecs.container.finished_at
action: delete
- key: aws.ecs.container.know_status
action: delete
- key: aws.ecs.docker.name
action: delete
- key: aws.ecs.container.image.id
action: delete
- key: aws.ecs.container.exit_code
action: delete
attributes/cleanup:
actions:
- key: cluster_name
action: delete
- key: ecs_task_definition
action: delete
- key: fluent.tag
action: delete
- key: ecs_task_arn
action: delete
- key: task_id
action: delete
resourcedetection/ecs:
detectors: [env, ecs]
timeout: 2s
override: false
attributes:
{{- range $k, $v := $resourceDetection }}
- {{ $k }}
{{- end }}
probabilistic_sampler:
hash_seed: 13
sampling_percentage: 10
probabilistic_sampler/tempo:
hash_seed: 13
sampling_percentage: 100
memory_limiter:
check_interval: 5s
limit_mib: 128
spike_limit_mib: 0
batch:
timeout: 200ms
extensions:
memory_ballast:
size_mib: 64
service:
extensions: [memory_ballast]
telemetry:
logs:
level: warn
pipelines:
metrics:
receivers: [prometheus, statsd, otlp]
processors: [memory_limiter, resourcedetection/ecs, resource/cloud, attributes/cleanup]
exporters: [prometheus, sumologic]
metrics/mimir:
receivers: [awsecscontainermetrics, prometheus, statsd, otlp]
processors: [memory_limiter, resourcedetection/ecs, resource, attributes, batch]
exporters: [prometheusremotewrite/mimir]
logs:
receivers: [fluentforward/net, fluentforward/sock]
processors: [memory_limiter, resourcedetection/ecs, resource/cloud, attributes/cleanup]
exporters: [sumologic]
logs/loki:
receivers: [fluentforward/net, fluentforward/sock]
processors: [memory_limiter, resourcedetection/ecs, resource, attributes, batch]
exporters: [loki]
traces:
receivers: [otlp, zipkin, jaeger]
processors: [memory_limiter, probabilistic_sampler, batch, resourcedetection/ecs, resource/cloud, attributes/cleanup]
exporters: [logging]
traces/tempo:
receivers: [otlp, zipkin, jaeger]
processors: [memory_limiter, probabilistic_sampler/tempo, resourcedetection/ecs, batch]
exporters: [otlp/tempo]
Log output
No response
Additional context
I use a confd based entrypoint in my container to render the config hence the metadata templating pieces.
We utilize the APK published by the contrib project for alpine
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 21 (14 by maintainers)
Ah, I mis-interpreted your comment. I’ve actually already tested that scenario and it works as expected. I’ll submit a PR.
@jpkrohling I’m pretty sure it’s actually the fluent forward receiver, as this was changed between these releases. I’m trying some details from pprof.