opentelemetry-collector-contrib: Understanding high memory usage of otelcol (prometheusreceiver / prometheusexporter)
Describe the bug
I am infrequently observing high memory usage of otelcol.exe
on Windows Systems.
The memory is however within the set bounds of memory_limiter
(see config below). But I am not sure if this memory usage is intended behavior, since there is negligible load on the systems. I am not experienced in golang so please excuse my lack of knowledge.
My planned actions to counteract this behavior is reduce the configured limit_mib
and fix the memory_ballast
.
Steps to reproduce
?? Full config see below.
What did you expect to see?
otelcol
should not consume “much” memory given the workload is little, and/or memory should eventually be garbage collected?
What did you see instead?
otelcol
process uses (?) up to 2GB of memory. Currently on my local machine here it sits at ~ 1.5GB.
Last stdout log message from memory_limiter
after machine was started from hibernation:
{"level":"info","ts":1652033885.9450645,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":197}
{"level":"info","ts":1652033885.8394392,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1623}
Is this the intended behavior of memory usage of the process?
What are recommended values for memory_limiter
? The documentation is a little vague.
On my machine, the prometheus
exporter currently emits ~ 19362 lines of metrics, less than 3 MB in size.
On our biggest instance, the prometheus exporter has 37654 metric lines , ~ 8MB.
Is this a lot?
The documentation on that page uses limit_mib: 4000
, which seems kinda HUGE for this kind of application?
After re-reading those docs just now, this line caught my attension:
limit_mib
(default = 0): Maximum amount of memory, in MiB, targeted to be allocated by the process heap. Note that typically the total memory usage of process will be about 50MiB higher than this value. This defines the hard limit.
So it actually is the expected behavor of otelcol to stay around limit_mib
indefinetely?
My trace queue size is currently at zero:
# HELP otelcol_exporter_queue_size Current size of the retry queue (in batches)
# TYPE otelcol_exporter_queue_size gauge
otelcol_exporter_queue_size{exporter="jaeger",service_instance_id="fc4ff71e-1a39-4b13-9990-db1ec97e9c9b",service_version="latest"} 0
I had the suspicion the trace buffer was filling memory when VPN was disconnected at jaeger
server not reachable, since the default queue size of 5000 (?) seemed rather high.
Looking at logs from last night, it has been the traces
responsible for causing the memory increase:
2022-05-07T01:22:01+02:00 {"level":"info","ts":1651879321.2295015,"caller":"memorylimiterprocessor/memorylimiter.go:303","msg":"Memory usage back within limits. Resuming normal operation.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":182}
2022-05-07T01:22:01+02:00 {"level":"info","ts":1651879321.2295015,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":182}
2022-05-07T01:20:19+02:00 {"level":"info","ts":1651879219.4642153,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1665}
2022-05-07T01:18:37+02:00 {"level":"info","ts":1651879117.4491055,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1661}
2022-05-07T01:16:55+02:00 {"level":"info","ts":1651879015.4828515,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1664}
2022-05-07T01:15:30+02:00 {"level":"info","ts":1651878930.5143886,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1683}
2022-05-07T01:14:05+02:00 {"level":"info","ts":1651878845.4196262,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1683}
2022-05-07T01:22:01+02:00 {"level":"warn","ts":1651879321.100273,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2059}
2022-05-07T01:20:19+02:00 {"level":"warn","ts":1651879219.098343,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2049}
2022-05-07T01:18:37+02:00 {"level":"warn","ts":1651879117.0985374,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2048}
2022-05-07T01:16:55+02:00 {"level":"warn","ts":1651879015.103772,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2012}
2022-05-07T01:15:30+02:00 {"level":"warn","ts":1651878930.103912,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2005}
2022-05-07T01:14:05+02:00 {"level":"warn","ts":1651878845.099606,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2061}
If the sending_queue
queue_size
of 5000 does not fit within memory limits, what is going to happen?
Oldest trace spans are going to get dropped from queue?
I did however also have on machine where otelcol used 2GB of memory suddenly, and no traces where being queued. No warnings at the time, unfortunately I have not info logs from that date (otelcol was still windows application event logging)
There are no apps emitting traces on that machine yet, so no Idea what has happend there.
I have collected and attached various debug information from http://localhost:1777/debug/pprof/, in case the memory usage is not okay.
What version did you use?
.\otelcol.exe --version
otelcol version **0.48.0**
What config did you use?
I am using two config files (currently 3 two debug) consisting of the following parts:
ConfigFile 1 (common and metrics):
exporters:
prometheus:
endpoint: 0.0.0.0:7299
metric_expiration: 5m # default = 5m
send_timestamps: true
receivers:
prometheus:
config:
scrape_configs:
- job_name: localmetrics
scrape_interval: 17s
tls_config:
insecure_skip_verify: true
static_configs:
- targets: [localhost:8888] # Self diagnostic metrics of otelcol
labels:
from_app: otelcol
file_sd_configs:
- files:
- "C:/Program Files/OpenTelemetry/OTEL Collector/metric-targets/*.yaml"
processors:
memory_limiter:
check_interval: 1s
limit_mib: 2000
batch:
extensions:
health_check:
endpoint: localhost:13133
zpages:
endpoint: localhost:42424
memory_ballast:
size_mib: 256
service:
extensions: [health_check , zpages]
pipelines:
metrics:
receivers: [prometheus]
processors: [memory_limiter , batch]
exporters: [prometheus]
# Otel-Collector Self-Diagnostics
telemetry:
logs:
level: info
encoding: json
output_paths: ["stdout"]
error_output_paths: ["stderr"]
metrics:
address: localhost:8888
I just noticed an error in my config… the memory_ballast
extension is not used it seems…
Second File used optionally (if traces are required):
exporters:
jaeger:
endpoint: "${AX_MONITORING_SERVER}:14250"
tls:
insecure: true
receivers:
jaeger:
protocols:
thrift_compact:
endpoint: localhost:6831
otlp:
protocols:
grpc:
endpoint: localhost:4317
http:
endpoint: localhost:4318
service:
pipelines:
traces:
receivers: [otlp , jaeger]
processors: [memory_limiter , batch]
exporters: [jaeger]
Third config file currently used to add pprof
:
extensions:
pprof:
endpoint: "127.0.0.1:1777"
block_profile_fraction: 3
mutex_profile_fraction: 5
service:
extensions: [pprof,health_check,zpages]
Environment OS: Windows 10 21H2, Windows Server 2019
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 50 (26 by maintainers)
Well, I’ll be damned. I ran my test scenario with Release
otelcol 0.57.2
vanilla, and it looks like (for our scenario at least) the memory consumption has IMPROVED DRAMATICALLY and is fixed so to speak ✔Big shoutout to @balintzs if #12765 was the golden change 👌🏻
On our biggest instance after running my test,
otelcol
process uses less memory thanprometheus
in agent mode. Will try again with a non-minimalistic configuration and run it long-term, but this looks very promising.>>>> Details on current config / Screenshots <<<<
Running Release file
otelcol_0.57.2_windows_amd64.tar.gz
Compare with https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/9998#issuecomment-1190288098
Config:
Give it a try @Doron-Bargo @kwiesmueller @holograph @RalphSu
If memory stays stable, what should happen with this issue? Close as solved I guess after more feedback from the community? @newly12 @jpkrohling
Gonna open some beers as soon this PITA is closed 🍺
I’ve been tracking this for a couple of days and we’re seeing the exact same behavior here; our instance is running in K8s so getting pprof dumps would be a major PITA, so I just wanted to chime in and thank @Mario-Hofstaetter profusely for putting in the work 👍
@newly12 Thats quite a dramatic difference. I’m definitely supportive of being able to disable start time tracking in that case. cc @Aneurysm9.
Given that, I think we should consider disabling it by default in the future, or moving it to a separate processor or to exporterhelper. Speaking from Google’s perspective, we’ve had to reimplement start time tracking in our exporter regardless, since not all receivers follow the spec for handling unknown start time that the prom receiver implements.
TL;DR: Using
disable_start_time: true
looks promisingMemory stayed < 400 MiB while running the actions in our application which caused new metrics. Memory usage of otelcol is now below prometheus in agent mode.
Memory footprint could also be especially low because of the custom build only containing used components? Will re-run the test with the custom binary but
disable_start_time
on false to double check.Thank you @newly12 for the builder config, what a fail using the wrong branch in the replace for the fork.
Also thank you so much for your fix. If this proves to be stable without issues, we can run the custom build and solve this issue that troubled us for months now.
>>> Used builder config including pprof <<<<
The compile worked despite forgetting to run
…I have no idea what that environment variable does, but I will repeat the compile nevertheless.
>>>>Used otelcol config <<<<
I started the built custom otelcol binary after restarting our application and re-ran my actions from yesterday.
Edit: after some more time still ok
Not sure why
otelcol_process_runtime_total_sys_memory_bytes
still slightly increased, will watch over a period of days.Edit 2: Verify it wasn’t the custom build but
disable_start_time
Running the same build (now including
otlp
/jaeger
receiver / exporter andbatch
processor if later needed), I re-ran the tests withdisable_start_time: false
andtrue
Note: The amount of metrics was still increasing at the end, which is why the memory usage was still slowly rising.
I will now let this otelcol process run for several days without restarts.
Edit: after running for ~ 1 week
Looking good.