opentelemetry-go: Using multiple concurrent readers makes async metrics produce wrong results

Environment

OS: Ubuntu
Architecture: x86
Go Version: 1.21.3
opentelemetry-go version: v1.21.0

Steps To Reproduce

Hi, I’m using the go.opentelemetry.io/otel/exporters/prometheus package to implement a custom Prometheus exporter. I use the OTEL SDK succesfully to export synchronous instruments like counter and histograms. Today, I started to add a Float64ObservableCounter to measure the uptime of the server but this has led to reporting of invalid values to the collector. I use the clickhouse exporter in the otelcollector. Here is the code of the uptime counter.

It seems that everytime Prometheus scrapes the server, gather() https://github.com/wundergraph/cosmo/blob/main/router/internal/metric/prom_registry.go#L26 is called and therefore also the callback of the Float64ObservableCounter. This somehow results in weird values: 27.59, 51.28, 117.59, 111.28

"{'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0', 'host.name': 'pop-os', 'process.pid': '242383'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9', 'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29'}",2023-12-03 20:59:44.323340915,2023-12-03 20:59:59.302967180,27.59962459,0,[],[],[],[],[],2,true
"{'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0', 'host.name': 'pop-os', 'process.pid': '242383', 'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9', 'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29'}",2023-12-03 20:59:44.323340915,2023-12-03 21:00:14.299107597,51.286032241,0,[],[],[],[],[],2,true
"{'host.name': 'pop-os', 'process.pid': '242383', 'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9', 'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29'}",2023-12-03 20:59:44.323340915,2023-12-03 21:00:29.298560178,117.595487252,0,[],[],[],[],[],2,true
"{'host.name': 'pop-os', 'process.pid': '242383', 'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29', 'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9'}",2023-12-03 20:59:44.323340915,2023-12-03 21:00:44.298351894,111.285146526,0,[],[],[],[],[],2,true

Correct would be 14.97, 29.97, 44.98, 59.97 because my export interval is 15s. I get those values after removing the sdkmetric.WithReader(promExp) from the MeterProvider.

"{'process.pid': '244072', 'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0', 'host.name': 'pop-os'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9', 'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29'}",2023-12-03 21:03:07.652430948,2023-12-03 21:03:22.630876613,14.978405556,0,[],[],[],[],[],2,true
"{'process.pid': '244072', 'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0', 'host.name': 'pop-os'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9', 'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29'}",2023-12-03 21:03:07.652430948,2023-12-03 21:03:37.630596492,29.978150114,0,[],[],[],[],[],2,true
"{'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0', 'host.name': 'pop-os', 'process.pid': '244072', 'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9', 'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29'}",2023-12-03 21:03:07.652430948,2023-12-03 21:03:52.636781141,44.984334753,0,[],[],[],[],[],2,true
"{'host.name': 'pop-os', 'process.pid': '244072', 'service.name': 'cosmo-router', 'telemetry.sdk.language': 'go', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0'}",https://opentelemetry.io/schemas/1.21.0,cosmo.router,0.0.1,{},0,"",uptime,The duration since the application started.,s,"{'wg.organization.id': 'c2437287-5bf1-403b-b3a9-84927c3a77d9', 'wg.federated_graph.id': '04f5cc26-42e3-4d29-b97b-b07358af0e29'}",2023-12-03 21:03:07.652430948,2023-12-03 21:04:07.630888393,59.978436785,0,[],[],[],[],[],2,true

Expected behavior

I’d expect that the readers does not influence each other.

About this issue

Original URL
State: closed
Created 7 months ago
Comments: 17 (9 by maintainers)

Commits related to this issue

reproduce #4741 in a unit test — committed to dashpole/opentelemetry-go by dashpole 7 months ago
reproduce #4741 in a unit test — committed to dashpole/opentelemetry-go by dashpole 7 months ago

Most upvoted comments

I’ll investigate

dashpole on Dec 4, 2023

Here is a reproducible example. The values are fine as long as you don’t call the prometheus metrics endpoint.

package main

import (
	"context"
	"fmt"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	"go.opentelemetry.io/otel/exporters/prometheus"
	"go.opentelemetry.io/otel/exporters/stdout/stdoutmetric"
	otelmetric "go.opentelemetry.io/otel/metric"
	"go.opentelemetry.io/otel/sdk/metric"
	"go.opentelemetry.io/otel/sdk/resource"
	semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
	"log"
	"net/http"
	"time"
)

const meterName = "github.com/open-telemetry/opentelemetry-go/example/prometheus"

func main() {
	ctx := context.Background()
	r, err := resource.New(ctx,
		resource.WithAttributes(semconv.ServiceNameKey.String("my-service")),
		resource.WithProcessPID(),
		resource.WithTelemetrySDK(),
		resource.WithHost(),
	)

	promExporter, err := prometheus.New()
	if err != nil {
		log.Fatal(err)
	}

	exporter, err := stdoutmetric.New(stdoutmetric.WithPrettyPrint())
	if err != nil {
		log.Fatal(err)
	}

	provider := metric.NewMeterProvider(metric.WithReader(promExporter), metric.WithResource(r), metric.WithReader(
		metric.NewPeriodicReader(exporter,
			metric.WithTimeout(time.Second*10),
			metric.WithInterval(time.Second*15),
		),
	))
	meter := provider.Meter(meterName)

	// Start the prometheus HTTP server and pass the exporter Collector to it
	go serveMetrics()

	uptime, err := meter.Float64ObservableCounter(
		"uptime",
		otelmetric.WithDescription("The duration since the application started."),
		otelmetric.WithUnit("s"),
	)
	if err != nil {
		log.Fatal(err)
	}

	a := time.Now()

	_, err = meter.RegisterCallback(
		func(ctx context.Context, o otelmetric.Observer) error {

			elapsed := time.Since(a)

			fmt.Println("elapsedSeconds", elapsed.Seconds())

			o.ObserveFloat64(uptime, elapsed.Seconds())
			return nil
		},
		uptime,
	)
	if err != nil {
		log.Fatal(err)
	}

	<-ctx.Done()
}

func serveMetrics() {
	log.Printf("serving metrics at localhost:2223/metrics")
	http.Handle("/metrics", promhttp.Handler())
	err := http.ListenAndServe(":2223", nil) //nolint:gosec // Ignoring G114: Use of net/http serve function that has no support for setting timeouts.
	if err != nil {
		fmt.Printf("error serving http: %v", err)
		return
	}
}

After visiting http://localhost:2223/metrics multiple times and waiting for a new cyclic exports of the OTEL exporter. The values are growing way too fast (from 30 to 161)

API server listening at: 127.0.0.1:45935
2023/12/04 11:20:05 serving metrics at localhost:2223/metrics
elapsedSeconds 3.691718944
elapsedSeconds 15.054160438
{
        "Resource": [
                {
                        "Key": "host.name",
                        "Value": {
                                "Type": "STRING",
                                "Value": "pop-os"
                        }
                },
                {
                        "Key": "process.pid",
                        "Value": {
                                "Type": "INT64",
                                "Value": 70298
                        }
                },
                {
                        "Key": "service.name",
                        "Value": {
                                "Type": "STRING",
                                "Value": "my-service"
                        }
                },
                {
                        "Key": "telemetry.sdk.language",
                        "Value": {
                                "Type": "STRING",
                                "Value": "go"
                        }
                },
                {
                        "Key": "telemetry.sdk.name",
                        "Value": {
                                "Type": "STRING",
                                "Value": "opentelemetry"
                        }
                },
                {
                        "Key": "telemetry.sdk.version",
                        "Value": {
                                "Type": "STRING",
                                "Value": "1.21.0"
                        }
                }
        ],
        "ScopeMetrics": [
                {
                        "Scope": {
                                "Name": "github.com/open-telemetry/opentelemetry-go/example/prometheus",
                                "Version": "",
                                "SchemaURL": ""
                        },
                        "Metrics": [
                                {
                                        "Name": "uptime",
                                        "Description": "The duration since the application started.",
                                        "Unit": "s",
                                        "Data": {
                                                "DataPoints": [
                                                        {
                                                                "Attributes": [],
                                                                "StartTime": "2023-12-04T11:20:05.520372477+01:00",
                                                                "Time": "2023-12-04T11:20:20.574583505+01:00",
                                                                "Value": 18.745879382
                                                        }
                                                ],
                                                "Temporality": "CumulativeTemporality",
                                                "IsMonotonic": true
                                        }
                                }
                        ]
                }
        ]
}
elapsedSeconds 19.343525635
elapsedSeconds 21.257364453
elapsedSeconds 21.873126748
elapsedSeconds 22.454254438
elapsedSeconds 23.017126718
elapsedSeconds 23.596896576
elapsedSeconds 30.028240212
{
        "Resource": [
                {
                        "Key": "host.name",
                        "Value": {
                                "Type": "STRING",
                                "Value": "pop-os"
                        }
                },
                {
                        "Key": "process.pid",
                        "Value": {
                                "Type": "INT64",
                                "Value": 70298
                        }
                },
                {
                        "Key": "service.name",
                        "Value": {
                                "Type": "STRING",
                                "Value": "my-service"
                        }
                },
                {
                        "Key": "telemetry.sdk.language",
                        "Value": {
                                "Type": "STRING",
                                "Value": "go"
                        }
                },
                {
                        "Key": "telemetry.sdk.name",
                        "Value": {
                                "Type": "STRING",
                                "Value": "opentelemetry"
                        }
                },
                {
                        "Key": "telemetry.sdk.version",
                        "Value": {
                                "Type": "STRING",
                                "Value": "1.21.0"
                        }
                }
        ],
        "ScopeMetrics": [
                {
                        "Scope": {
                                "Name": "github.com/open-telemetry/opentelemetry-go/example/prometheus",
                                "Version": "",
                                "SchemaURL": ""
                        },
                        "Metrics": [
                                {
                                        "Name": "uptime",
                                        "Description": "The duration since the application started.",
                                        "Unit": "s",
                                        "Data": {
                                                "DataPoints": [
                                                        {
                                                                "Attributes": [],
                                                                "StartTime": "2023-12-04T11:20:05.520372477+01:00",
                                                                "Time": "2023-12-04T11:20:35.548649629+01:00",
                                                                "Value": 161.57053478
                                                        }
                                                ],
                                                "Temporality": "CumulativeTemporality",
                                                "IsMonotonic": true
                                        }
                                }
                        ]
                }
        ]
}
elapsedSeconds 45.004374272
{
        "Resource": [
                {
                        "Key": "host.name",
                        "Value": {
                                "Type": "STRING",
                                "Value": "pop-os"
                        }
                },
                {
                        "Key": "process.pid",
                        "Value": {
                                "Type": "INT64",
                                "Value": 70298
                        }
                },
                {
                        "Key": "service.name",
                        "Value": {
                                "Type": "STRING",
                                "Value": "my-service"
                        }
                },
                {
                        "Key": "telemetry.sdk.language",
                        "Value": {
                                "Type": "STRING",
                                "Value": "go"
                        }
                },
                {
                        "Key": "telemetry.sdk.name",
                        "Value": {
                                "Type": "STRING",
                                "Value": "opentelemetry"
                        }
                },
                {
                        "Key": "telemetry.sdk.version",
                        "Value": {
                                "Type": "STRING",
                                "Value": "1.21.0"
                        }
                }
        ],
        "ScopeMetrics": [
                {
                        "Scope": {
                                "Name": "github.com/open-telemetry/opentelemetry-go/example/prometheus",
                                "Version": "",
                                "SchemaURL": ""
                        },
                        "Metrics": [
                                {
                                        "Name": "uptime",
                                        "Description": "The duration since the application started.",
                                        "Unit": "s",
                                        "Data": {
                                                "DataPoints": [
                                                        {
                                                                "Attributes": [],
                                                                "StartTime": "2023-12-04T11:20:05.520372477+01:00",
                                                                "Time": "2023-12-04T11:20:50.524773219+01:00",
                                                                "Value": 45.004374272
                                                        }
                                                ],
                                                "Temporality": "CumulativeTemporality",
                                                "IsMonotonic": true
                                        }
                                }
                        ]
                }
        ]
}

StarpTech on Dec 4, 2023

Fix: https://github.com/open-telemetry/opentelemetry-go/pull/4742

dashpole on Dec 7, 2023

@dashpole great findings. I came to the same conclusion. Therefore, every reader should maintain its own state of aggregators for asynchronous instruments? If this is confirmed, it is a major bug.

StarpTech on Dec 5, 2023

Reporting progress so far. I’ve figured out that when Collect is invoked on a single reader, measure is invoked for all readers. But then, we only call the aggregation function (e.g. precomutedSum.delta() for that reader’s aggregation. This means that for each reader, the callback seems to increment the value for all readers, but only measuring and resetting a single reader’s storage.

Here are a bunch of print statements to illustrate:

Aggregators() len(inserters) = 2
newPrecomputedSum
newValueMap 0xc000118b60
newPrecomputedSum
newValueMap 0xc000118b70
0xc000118b60: BEFORE measure(val: 50, val: {equivalent:{iface:[{Key:pid Value:{vtype:4 numeric:0 stringly:1001 slice:<nil>}} {Key:tid Value:{vtype:2 numeric:1 stringly: slice:<nil>}}]}}) = 0
0xc000118b60: AFTER measure(val: 50, val: {equivalent:{iface:[{Key:pid Value:{vtype:4 numeric:0 stringly:1001 slice:<nil>}} {Key:tid Value:{vtype:2 numeric:1 stringly: slice:<nil>}}]}}) = 50
0xc000118b70: BEFORE measure(val: 50, val: {equivalent:{iface:[{Key:pid Value:{vtype:4 numeric:0 stringly:1001 slice:<nil>}} {Key:tid Value:{vtype:2 numeric:1 stringly: slice:<nil>}}]}}) = 0
0xc000118b70: AFTER measure(val: 50, val: {equivalent:{iface:[{Key:pid Value:{vtype:4 numeric:0 stringly:1001 slice:<nil>}} {Key:tid Value:{vtype:2 numeric:1 stringly: slice:<nil>}}]}}) = 50
precomputedSum.delta (0xc000118b60) START s.values: map[{equivalent:{iface:[{Key:pid Value:{vtype:4 numeric:0 stringly:1001 slice:<nil>}} {Key:tid Value:{vtype:2 numeric:1 stringly: slice:<nil>}}]}}:50]
precomputedSum.delta (0xc000118b60) END s.values: map[]
0xc000118b60: BEFORE measure(val: 50, val: {equivalent:{iface:[{Key:pid Value:{vtype:4 numeric:0 stringly:1001 slice:<nil>}} {Key:tid Value:{vtype:2 numeric:1 stringly: slice:<nil>}}]}}) = 0
0xc000118b60: AFTER measure(val: 50, val: {equivalent:{iface:[{Key:pid Value:{vtype:4 numeric:0 stringly:1001 slice:<nil>}} {Key:tid Value:{vtype:2 numeric:1 stringly: slice:<nil>}}]}}) = 50
0xc000118b70: BEFORE measure(val: 50, val: {equivalent:{iface:[{Key:pid Value:{vtype:4 numeric:0 stringly:1001 slice:<nil>}} {Key:tid Value:{vtype:2 numeric:1 stringly: slice:<nil>}}]}}) = 50
0xc000118b70: AFTER measure(val: 50, val: {equivalent:{iface:[{Key:pid Value:{vtype:4 numeric:0 stringly:1001 slice:<nil>}} {Key:tid Value:{vtype:2 numeric:1 stringly: slice:<nil>}}]}}) = 100
precomputedSum.delta (0xc000118b70) START s.values: map[{equivalent:{iface:[{Key:pid Value:{vtype:4 numeric:0 stringly:1001 slice:<nil>}} {Key:tid Value:{vtype:2 numeric:1 stringly: slice:<nil>}}]}}:100]
precomputedSum.delta (0xc000118b70) END s.values: map[]

0xc000118b60 is the address of one of the sums’ valueMap, and 0xc000118b70 is the other. You can see 2 measure() calls for each precomputedSum.delta call

The scenario above is for a callback that always returns 50. So you can see that when the second reader reads the value, it was already incremented to 50 by the previous reader, and is (incorrectly) incremented again to 100.

@MrAlias or @MadVikingGod may have a better intuition for how this could happen.

dashpole on Dec 5, 2023

Can we ping anyone from the core team who can bring light into this?

StarpTech on Dec 4, 2023