VictoriaMetrics: VMAgent sharding and streaming aggregation causes duplicate series data and does not summarize correctly

Describe the bug

When doing sharding of vmagent with streaming aggregation enabled, there is no differentiation between vmagent shards.

For example, say there is a metric requests_total which should be aggregated to requests_total:30s_without_instance_total. Each shard will output requests_total:30s_without_instance_total with no labels to differentiate the series.

An example of what this ends up doing with 4 shards is this.

Two graphs, one using aggregation, the other using the original data. Both delivered by vmagent shards.

sum(rate(input_events:30s_without_instance_pod_total[1m])) vs sum(rate(input_events[1m])) image image

The sharded aggregation rate calculation always puts out these “Charlie Brown” shaped graphs.

A workaround to this is to add a label to each shard, I did this with the following: --remoteWrite.label=vmagent=%{HOSTNAME}

After that, the original and aggregated data are aligned: image

To Reproduce

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMAgent
metadata:
  name: vmagent
  namespace: monitoring
spec:
  image:
    tag: v1.90.0
  selectAllByDefault: true
  scrapeInterval: "30s"
  replicaCount: 1
  shardCount: 4
  logFormat: json
  resources:
    requests:
      cpu: "2.5"
      memory: "3Gi"
  extraArgs:
    # remoteWrite.label: "vmagent=%{HOSTNAME}"
  remoteWrite:
    - url: "http://my_vm/insert/0/prometheus/api/v1/write"
      sendTimeout: "2m"
      streamAggrConfig:
        keepInput: false
        rules:
          - match: '{__name__=~"input_.+"}'
            interval: "30s"
            outputs: ["total","sum_samples"]
            without: ["pod", "instance"]

Version

v1.90.0

Logs

No response

Screenshots

No response

Used command-line flags

No response

Additional information

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 16 (2 by maintainers)

Commits related to this issue

Most upvoted comments

Hey there. The problem @Maybeee233 wrote about is indeed reproducible. The point is that the current position of remoteWrite.label in the metric’s relabeling lifecycle is highlighted in red in the diagram:

prometheus-convertion-2

This causes this label to be deleted at one of the later stages of relabeling, and it is also removed from the aggregate if it is not specified in the by field.

Judging by the comments in the source code it is justified by compatibility with Prometheus:

https://github.com/VictoriaMetrics/VictoriaMetrics/blob/c36259fca5ae9c8e58e9d6c56512cdcbedd091c3/app/vmagent/remotewrite/relabel.go#L90-L110

This refers to the following phrase in the Prometheus documentation:

image

But the problem is that the above function doesn’t apply to extra_labels (from promscrape.config), it only applies to remoteWrite.label here:

https://github.com/VictoriaMetrics/VictoriaMetrics/blob/c36259fca5ae9c8e58e9d6c56512cdcbedd091c3/app/vmagent/remotewrite/remotewrite.go#L390

As for me, applying remoteWrite.label at this point looks like an error, but changing the behavior of this option now would be breaking.

Bottom line: In its current form, this workaround will only work if you additionally specify this label in the by field for every aggregation.

@hagen1778 wdyt?

@Amper would you mind checking about the correct usage of ENV variables in the config mentioned by @Maybeee233 ? I believe this case should be related to the docs update we discussed earlier.

@mbrancato @hagen1778

Yes, the problem does occur and is related to the fact that the same timeseries (because of the relabeling) with different intermediate value is pushed from different vmagent shards and written to the storage with different timestamps - as a result:

  • with de-duplication enabled we lose data because the de-duplication will leave a sample from only one of the agents for each interval
  • with de-duplication disabled we get incorrect data (“Charlie Brown” shaped graphs).

Proposed solution:

  • by default, as a result of streaming aggregation, add a special label (e.g. vmagent_shard_num) to aggregated timeseries with the shard number (only on the sharded agents)
  • add a parameter to the configuration to disabling this feature
  • in the release notes it may be worth warning about a temporary increase in churn rate after the update

We will discuss this proposal with our colleagues and then decide whether we will implement it.


P.S. Exactly the same problem can occur without stream-aggregation if relaybeling is used, which will result in the same series on different agent shards.

Both would ship a metric http_requests:30s_without_pod_instance_total but the values would be disjoint in a time-series fashion. There was no way to differentiate they came from two different sets of pods / different shards.

I’d expect VM still summarise them correctly, even if they have identical labels. The problem would be if their timestamps match identically as well. In this case, VM starts to deduplicate such datapoints. Thanks for the context - I’ll try to reproduce this.