opentelemetry-go: Using multiple concurrent readers makes async metrics produce wrong results
Environment
- OS: Ubuntu
- Architecture: x86
- Go Version: 1.21.3
- opentelemetry-go version: v1.21.0
Steps To Reproduce
Hi, I’m using the go.opentelemetry.io/otel/exporters/prometheus
package to implement a custom Prometheus exporter. I use the OTEL SDK succesfully to export synchronous instruments like counter and histograms. Today, I started to add a Float64ObservableCounter
to measure the uptime of the server but this has led to reporting of invalid values to the collector. I use the clickhouse exporter in the otelcollector. Here is the code of the uptime counter.
It seems that everytime Prometheus scrapes the server, gather()
https://github.com/wundergraph/cosmo/blob/main/router/internal/metric/prom_registry.go#L26 is called and therefore also the callback of the Float64ObservableCounter
. This somehow results in weird values: 27.59, 51.28, 117.59, 111.28
"{'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0', 'host.name': 'pop-os', 'process.pid': '242383'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9', 'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29'}",2023-12-03 20:59:44.323340915,2023-12-03 20:59:59.302967180,27.59962459,0,[],[],[],[],[],2,true
"{'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0', 'host.name': 'pop-os', 'process.pid': '242383', 'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9', 'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29'}",2023-12-03 20:59:44.323340915,2023-12-03 21:00:14.299107597,51.286032241,0,[],[],[],[],[],2,true
"{'host.name': 'pop-os', 'process.pid': '242383', 'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9', 'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29'}",2023-12-03 20:59:44.323340915,2023-12-03 21:00:29.298560178,117.595487252,0,[],[],[],[],[],2,true
"{'host.name': 'pop-os', 'process.pid': '242383', 'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29', 'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9'}",2023-12-03 20:59:44.323340915,2023-12-03 21:00:44.298351894,111.285146526,0,[],[],[],[],[],2,true
Correct would be 14.97, 29.97, 44.98, 59.97
because my export interval is 15s. I get those values after removing the sdkmetric.WithReader(promExp)
from the MeterProvider.
"{'process.pid': '244072', 'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0', 'host.name': 'pop-os'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9', 'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29'}",2023-12-03 21:03:07.652430948,2023-12-03 21:03:22.630876613,14.978405556,0,[],[],[],[],[],2,true
"{'process.pid': '244072', 'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0', 'host.name': 'pop-os'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9', 'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29'}",2023-12-03 21:03:07.652430948,2023-12-03 21:03:37.630596492,29.978150114,0,[],[],[],[],[],2,true
"{'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0', 'host.name': 'pop-os', 'process.pid': '244072', 'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9', 'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29'}",2023-12-03 21:03:07.652430948,2023-12-03 21:03:52.636781141,44.984334753,0,[],[],[],[],[],2,true
"{'host.name': 'pop-os', 'process.pid': '244072', 'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9', 'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29'}",2023-12-03 21:03:07.652430948,2023-12-03 21:04:07.630888393,59.978436785,0,[],[],[],[],[],2,true
Expected behavior
I’d expect that the readers does not influence each other.
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 17 (9 by maintainers)
Commits related to this issue
- reproduce #4741 in a unit test — committed to dashpole/opentelemetry-go by dashpole 7 months ago
- reproduce #4741 in a unit test — committed to dashpole/opentelemetry-go by dashpole 7 months ago
I’ll investigate
Here is a reproducible example. The values are fine as long as you don’t call the prometheus metrics endpoint.
After visiting
http://localhost:2223/metrics
multiple times and waiting for a new cyclic exports of the OTEL exporter. The values are growing way too fast (from 30 to 161)Fix: https://github.com/open-telemetry/opentelemetry-go/pull/4742
@dashpole great findings. I came to the same conclusion. Therefore, every reader should maintain its own state of aggregators for asynchronous instruments? If this is confirmed, it is a major bug.
Reporting progress so far. I’ve figured out that when Collect is invoked on a single reader, measure is invoked for all readers. But then, we only call the aggregation function (e.g. precomutedSum.delta() for that reader’s aggregation. This means that for each reader, the callback seems to increment the value for all readers, but only measuring and resetting a single reader’s storage.
Here are a bunch of print statements to illustrate:
0xc000118b60 is the address of one of the sums’ valueMap, and 0xc000118b70 is the other. You can see 2
measure()
calls for eachprecomputedSum.delta call
The scenario above is for a callback that always returns 50. So you can see that when the second reader reads the value, it was already incremented to 50 by the previous reader, and is (incorrectly) incremented again to 100.
@MrAlias or @MadVikingGod may have a better intuition for how this could happen.
Can we ping anyone from the core team who can bring light into this?