opentelemetry-collector-contrib: Understanding high memory usage of otelcol (prometheusreceiver / prometheusexporter)

Describe the bug

I am infrequently observing high memory usage of otelcol.exe on Windows Systems. The memory is however within the set bounds of memory_limiter (see config below). But I am not sure if this memory usage is intended behavior, since there is negligible load on the systems. I am not experienced in golang so please excuse my lack of knowledge.

My planned actions to counteract this behavior is reduce the configured limit_mib and fix the memory_ballast.

Steps to reproduce

?? Full config see below.

What did you expect to see?

otelcol should not consume “much” memory given the workload is little, and/or memory should eventually be garbage collected?

What did you see instead?

otelcol process uses (?) up to 2GB of memory. Currently on my local machine here it sits at ~ 1.5GB. Last stdout log message from memory_limiter after machine was started from hibernation:

{"level":"info","ts":1652033885.9450645,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":197}
{"level":"info","ts":1652033885.8394392,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1623}

Is this the intended behavior of memory usage of the process?

What are recommended values for memory_limiter ? The documentation is a little vague.

On my machine, the prometheus exporter currently emits ~ 19362 lines of metrics, less than 3 MB in size. On our biggest instance, the prometheus exporter has 37654 metric lines , ~ 8MB. Is this a lot?

The documentation on that page uses limit_mib: 4000 , which seems kinda HUGE for this kind of application?

After re-reading those docs just now, this line caught my attension:

limit_mib (default = 0): Maximum amount of memory, in MiB, targeted to be allocated by the process heap. Note that typically the total memory usage of process will be about 50MiB higher than this value. This defines the hard limit.

So it actually is the expected behavor of otelcol to stay around limit_mib indefinetely?

My trace queue size is currently at zero:

# HELP otelcol_exporter_queue_size Current size of the retry queue (in batches)
# TYPE otelcol_exporter_queue_size gauge
otelcol_exporter_queue_size{exporter="jaeger",service_instance_id="fc4ff71e-1a39-4b13-9990-db1ec97e9c9b",service_version="latest"} 0

I had the suspicion the trace buffer was filling memory when VPN was disconnected at jaeger server not reachable, since the default queue size of 5000 (?) seemed rather high.

Looking at logs from last night, it has been the traces responsible for causing the memory increase:

2022-05-07T01:22:01+02:00	{"level":"info","ts":1651879321.2295015,"caller":"memorylimiterprocessor/memorylimiter.go:303","msg":"Memory usage back within limits. Resuming normal operation.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":182}
2022-05-07T01:22:01+02:00	{"level":"info","ts":1651879321.2295015,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":182}
2022-05-07T01:20:19+02:00	{"level":"info","ts":1651879219.4642153,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1665}
2022-05-07T01:18:37+02:00	{"level":"info","ts":1651879117.4491055,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1661}
2022-05-07T01:16:55+02:00	{"level":"info","ts":1651879015.4828515,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1664}
2022-05-07T01:15:30+02:00	{"level":"info","ts":1651878930.5143886,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1683}
2022-05-07T01:14:05+02:00	{"level":"info","ts":1651878845.4196262,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1683}
2022-05-07T01:22:01+02:00	{"level":"warn","ts":1651879321.100273,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2059}
2022-05-07T01:20:19+02:00	{"level":"warn","ts":1651879219.098343,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2049}
2022-05-07T01:18:37+02:00	{"level":"warn","ts":1651879117.0985374,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2048}
2022-05-07T01:16:55+02:00	{"level":"warn","ts":1651879015.103772,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2012}
2022-05-07T01:15:30+02:00	{"level":"warn","ts":1651878930.103912,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2005}
2022-05-07T01:14:05+02:00	{"level":"warn","ts":1651878845.099606,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2061}

If the sending_queue queue_size of 5000 does not fit within memory limits, what is going to happen? Oldest trace spans are going to get dropped from queue?

I did however also have on machine where otelcol used 2GB of memory suddenly, and no traces where being queued. No warnings at the time, unfortunately I have not info logs from that date (otelcol was still windows application event logging)

grafik

There are no apps emitting traces on that machine yet, so no Idea what has happend there.

I have collected and attached various debug information from http://localhost:1777/debug/pprof/, in case the memory usage is not okay.

What version did you use?

.\otelcol.exe --version
otelcol version **0.48.0**

What config did you use?

I am using two config files (currently 3 two debug) consisting of the following parts:

ConfigFile 1 (common and metrics):

exporters:
  prometheus:
    endpoint: 0.0.0.0:7299
    metric_expiration: 5m  # default = 5m
    send_timestamps: true

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: localmetrics
          scrape_interval: 17s

          tls_config:
            insecure_skip_verify: true

          static_configs:
            - targets: [localhost:8888] # Self diagnostic metrics of otelcol
              labels:
                from_app: otelcol
          file_sd_configs:
            - files:
              - "C:/Program Files/OpenTelemetry/OTEL Collector/metric-targets/*.yaml"

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 2000
  batch:

extensions:
  health_check:
    endpoint: localhost:13133
  zpages:
    endpoint: localhost:42424
  memory_ballast:
    size_mib: 256

service:
  extensions: [health_check , zpages]
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [memory_limiter , batch]
      exporters: [prometheus]

  # Otel-Collector Self-Diagnostics
  telemetry:
    logs:
      level: info
      encoding: json
      output_paths:       ["stdout"]
      error_output_paths: ["stderr"]
    metrics:
      address: localhost:8888

I just noticed an error in my config… the memory_ballast extension is not used it seems…

Second File used optionally (if traces are required):

exporters:
  jaeger:
    endpoint: "${AX_MONITORING_SERVER}:14250"
    tls:
      insecure: true

receivers:
  jaeger:
    protocols:
      thrift_compact:
        endpoint: localhost:6831
  otlp:
    protocols:
      grpc:
        endpoint: localhost:4317
      http:
        endpoint: localhost:4318

service:
  pipelines:
    traces:
      receivers: [otlp , jaeger]
      processors: [memory_limiter , batch]
      exporters: [jaeger]

Third config file currently used to add pprof:

extensions:
  pprof:
    endpoint: "127.0.0.1:1777"
    block_profile_fraction: 3
    mutex_profile_fraction: 5
service:
  extensions: [pprof,health_check,zpages]

Environment OS: Windows 10 21H2, Windows Server 2019

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 50 (26 by maintainers)

Most upvoted comments

Well, I’ll be damned. I ran my test scenario with Release otelcol 0.57.2 vanilla, and it looks like (for our scenario at least) the memory consumption has IMPROVED DRAMATICALLY and is fixed so to speak ✔

Big shoutout to @balintzs if #12765 was the golden change 👌🏻

On our biggest instance after running my test, otelcol process uses less memory than prometheus in agent mode. Will try again with a non-minimalistic configuration and run it long-term, but this looks very promising.

>>>> Details on current config / Screenshots <<<<

Running Release file otelcol_0.57.2_windows_amd64.tar.gz

Compare with https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/9998#issuecomment-1190288098

grafik

Config:

exporters:
  prometheus:
    endpoint: 0.0.0.0:7299
    metric_expiration: 1m  # default = 5m

receivers:
  prometheus:
    # disable_start_time: true
    config:
      scrape_configs:
        - job_name: localmetrics
          scrape_interval: 17s

          tls_config:
            insecure_skip_verify: true

          static_configs:
            - targets: [localhost:8888] # Self diagnostic metrics of otelcol
              labels:
                app: otelcol
          file_sd_configs:
            - files:
              - "C:/Program Files/OpenTelemetry/OTEL Collector/metric-targets/*.yaml"

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1500
    
extensions:
  pprof:
    endpoint: "127.0.0.1:1777"

service:
  extensions: [pprof]
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [memory_limiter]
      exporters: [prometheus]
      
  telemetry:
    logs:
      level: info
      encoding: json
      output_paths:       ["stdout"]
      error_output_paths: ["stderr"]
    metrics:
      address: localhost:8888

Give it a try @Doron-Bargo @kwiesmueller @holograph @RalphSu

If memory stays stable, what should happen with this issue? Close as solved I guess after more feedback from the community? @newly12 @jpkrohling

Gonna open some beers as soon this PITA is closed 🍺

I’ve been tracking this for a couple of days and we’re seeing the exact same behavior here; our instance is running in K8s so getting pprof dumps would be a major PITA, so I just wanted to chime in and thank @Mario-Hofstaetter profusely for putting in the work 👍

@newly12 Thats quite a dramatic difference. I’m definitely supportive of being able to disable start time tracking in that case. cc @Aneurysm9.

Given that, I think we should consider disabling it by default in the future, or moving it to a separate processor or to exporterhelper. Speaking from Google’s perspective, we’ve had to reimplement start time tracking in our exporter regardless, since not all receivers follow the spec for handling unknown start time that the prom receiver implements.

TL;DR: Using disable_start_time: true looks promising

Memory stayed < 400 MiB while running the actions in our application which caused new metrics. Memory usage of otelcol is now below prometheus in agent mode.

Memory footprint could also be especially low because of the custom build only containing used components? Will re-run the test with the custom binary but disable_start_time on false to double check.

Thank you @newly12 for the builder config, what a fail using the wrong branch in the replace for the fork.

Also thank you so much for your fix. If this proves to be stable without issues, we can run the custom build and solve this issue that troubled us for months now.

>>> Used builder config including pprof <<<<
exporters:
  - gomod: "github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter v0.55.0"

receivers:
  - gomod: "github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver v0.55.0"

processors:
  - import: go.opentelemetry.io/collector/processor/memorylimiterprocessor
    gomod: go.opentelemetry.io/collector v0.55.0

extensions:
  - gomod: "github.com/open-telemetry/opentelemetry-collector-contrib/extension/pprofextension v0.55.0"

replaces:
  # a list of "replaces" directives that will be part of the resulting go.mod
  - github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver => github.com/newly12/opentelemetry-collector-contrib/receiver/prometheusreceiver prom_receiver_mem

The compile worked despite forgetting to run

$ GO111MODULE=on go install go.opentelemetry.io/collector/cmd/builder@latest

…I have no idea what that environment variable does, but I will repeat the compile nevertheless.

>>>>Used otelcol config <<<<
exporters:
  prometheus:
    endpoint: 0.0.0.0:7299
    metric_expiration: 1m  # default = 5m

receivers:
  prometheus:
    disable_start_time: true
    config:
      scrape_configs:
        - job_name: localmetrics
          scrape_interval: 17s

          tls_config:
            insecure_skip_verify: true

          static_configs:
            - targets: [localhost:8888] # Self diagnostic metrics of otelcol
              labels:
                app: otelcol
          file_sd_configs:
            - files:
              - "C:/Program Files/OpenTelemetry/OTEL Collector/metric-targets/*.yaml"

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 3000
    
extensions:
  pprof:
    endpoint: "127.0.0.1:1777"

service:
  extensions: [pprof]
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [memory_limiter]
      exporters: [prometheus]
      
  telemetry:
    logs:
      level: info
      encoding: json
      output_paths:       ["stdout"]
      error_output_paths: ["stderr"]
    metrics:
      address: localhost:8888

I started the built custom otelcol binary after restarting our application and re-ran my actions from yesterday.

grafik

Edit: after some more time still ok

grafik

Not sure why otelcol_process_runtime_total_sys_memory_bytes still slightly increased, will watch over a period of days.

Edit 2: Verify it wasn’t the custom build but disable_start_time

Running the same build (now including otlp / jaeger receiver / exporter and batch processor if later needed), I re-ran the tests with disable_start_time: false and true

grafik

Note: The amount of metrics was still increasing at the end, which is why the memory usage was still slowly rising.

I will now let this otelcol process run for several days without restarts.

Edit: after running for ~ 1 week

Looking good.

grafik