opentelemetry-collector-contrib: [exporter/prometheusremotewrite] "target" metric causes out-of-order samples errors on load balanced collector deployment

Describe the bug We operate a cluster of multiple opentelemetry-collector configured to be load balanced and autoscaled. We recently updated from 0.35 to 0.49. This update caused our backend to start complaining about out-of-order samples on the “target” metric:

2022-07-11T21:19:07.569Z        error   exporterhelper/queued_retry.go:149      Exporting failed. Try enabling retry_on_failure config option.  {"kind": "exporter", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = <nil>: user=[hidden]: err: out of order sample. timestamp=2022-07-11T21:19:00.673Z, series={__name__="target", http_scheme="http", instance="[hidden]:8080", job="cadvisor", net_host_name="[hidden]", net_host_port="8"}

We think the issue stems from how the timestamp of the “target” metric is generated. The timestamp of the sample is generated from the batch of metrics and because multiple samples are generated concurrently in different instance of opentelemetry, they may not align and end up in the backend with the timestamp out of order. In addition, because every instance of opentelemetry emits the metric with the exact same set of labels, it’s causing metric collisions.

Steps to reproduce

Configure a deployment of at least 2 opentelemetry-collector with identical config
Configure a loadbalancer in front of the opentelemetry-collector(in our case, AWS ALB)
Send metrics to the load balanced endpoint.

What did you expect to see? We wish to have an additionnal label just for the “target” metric. This label would be allow to have a separate serie for each instance of opentelemetry-collector. external_labels is not appropriate for our use-case because it affect all metrics and causes duplication issues, cardinality explosion and generally a big mess.

What did you see instead? Many errors in the logs of opentelemetry-collector and multiple dropped samples of the “target” metric because of an out-of-order samples error.

What version did you use? 0.49

What config did you use?

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  prometheusremotewrite:
    endpoint: "${PROMETHEUS_REMOTE_WRITE_ENDPOINT}"
    timeout: 30s
    # external_labels:
    #   replica: "${NAME}"

service:
  extensions: []
  pipelines:
    metrics:
      receivers: [otlp]
      processors: []
      exporters: [prometheusremotewrite]

Environment deployed in AWS EKS remotewrite endpoint is AWS Amazon Managed Service for Prometheus

Additional context The “target” metric has been added in the following pr: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/8493. There is an interesting comment about the choice of the timestamp of the sample.

I am willing to work on a pr.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 4
Comments: 25 (24 by maintainers)

Most upvoted comments

@badjware @dashpole can we close this issue now?

hardproblems on Sep 30, 2022

@mrblonde91 No, this shouldn’t impact any other metrics.

dashpole on Sep 19, 2022

I think we are probably doing the right thing in the prometheus receiver… The labels config is supposed to add a label to every metric series. If there was an incoming target_info metric, it should add the label to that before it is converted to a resource attribute. https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/9967#issuecomment-1124318434 is also related, but I think there probably needs to be an opt-out for generating target_info in the exporter for cases like these.

dashpole on Jul 19, 2022