telegraf: Outputs.Stackdriver `tags_as_resource_labels` config option doesn't appear to be working as expected

Relevant telegraf.conf

## Relevant Agent Configuration

[inputs.mem]
  [inputs.mem.tags]
    job = "inputs.mem"

[inputs.processes]
  [inputs.processes.tags]
    job = "inputs.processes"

[[outputs.opentelemetry]]
  service_address = "server:4317"


## Relevant Server Config
## Server (Values that are {{}} are hydrated by a key value store when
## Telegraf is deployed):

[global_tags]

# Configuration for telegraf agent
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 199
  metric_buffer_limit = {{.METRIC_BUFFER}}
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = {{.DEBUG}}
  quiet = false

# Receive OpenTelemetry traces, metrics, and logs over gRPC
[[inputs.opentelemetry]]
  ## Override the default (0.0.0.0:4317) destination OpenTelemetry gRPC service
  ## address:port
  service_address = "0.0.0.0:4317"

[[processors.regex]]
  # Rename metric names (measurement names) that don't match Prometheus requirements
  [[processors.regex.metric_rename]]
    pattern = "[^a-zA-Z0-9_]+"
    replacement = "_"

  # Rename tag keys that don't match Prometheus requirements
  [[processors.regex.tag_rename]]
    pattern = "[^a-zA-Z0-9_]+"
    replacement = "_"

  # Rename field keys that don't match Prometheus requirements
  [[processors.regex.field_rename]]
    pattern = "[^a-zA-Z0-9_]+"
    replacement = "_"

    # Configuration for sending metrics to GMP
[[outputs.stackdriver]]
  project = "{{.PROJECT_ID}}"
  resource_type = "prometheus_target"
  metric_name_format = "official"
  metric_data_type = "double"
  metric_type_prefix = "prometheus.googleapis.com"
  tags_as_resource_label = ["instance", "job"]
  # Ignore metrics from inputs.internal
  namedrop = ["internal_*"]
  [outputs.stackdriver.resource_labels]
    cluster = "{{.CLUSTER_NAME}}"
    job = "Telegraf"
    instance = "{{.CLUSTER_NAME}}"
    location = "{{.LOCATION}}"
    namespace = "{{.NAMESPACE_LABEL}}"

Logs from Telegraf

The logs are unremarkable, but when outputting what the agent is sending to outputs.stackdriver using outputs.file I see that the tags are set correctly.

Example: Memory metric has job=inputs.mem Processes metric has job=inputs.processes

mem,env=Production_MacOS,host=hostname,instance=hostname,job=inputs.mem active=965160960i,available=1003286528i,used_percent=94.1601037979126,available_percent=5.839896202087402,inactive=954462208i,wired=693956608i,total=17179869184i,used=16176582656i,free=48824320i 1694460660000000000
processes,env=Production_MacOS,host=hostname,instance=hostname,job=inputs.processes blocked=0i,zombies=1i,stopped=0i,running=3i,sleeping=428i,total=432i,unknown=0i,idle=0i 1694460660000000000

But when I look at the metrics in Google or Grafana I notice that it looks like the metrics are sometimes tagged with another inputs tag, IE job=inputs.mem but for a processes metric. From what I can tell it’s usually the busiest metric. Additionally, it appears to be the last metric’s tag that is sent is the one that is used for all the entire batch of metrics. Including metrics that don’t contain a job tag. Meaning that the default value isn’t applied, instead whatever the last value for the tag is used.

### System info

1.27.4

### Docker

_No response_

Steps to reproduce

Send metrics from a Telegraf client with a tag called job configured for each input plugin to a Telegraf server
Configure the Telegraf server such that it uses tags_as_resource_labels for job tag.
Send metrics to Google …

Expected behavior

Each metric will have their job tag (or any other applicable tag) applied as a resource label.

Actual behavior

The tag appears to change depending on what the last tag was sent. As you can see in the screenshot the job tag changed between inputs.mem and inputs.processes, even though the sending agent didn’t change. Here’s an example of it changing applying a mismatched job label on two different inputs:

What also seems weird is that you can stop sending a tag altogether and the last sent tag is what continues to be sent instead of the default value:

[inputs.mem]
  # [inputs.mem.tags]
  #   job = "inputs.mem"

[inputs.processes]
  # [inputs.processes.tags]
  #   job = "inputs.processes"

Telegraf Client outputs.file output showing no job tag:

mem,env=Production_MacOS,host=hostname,instance=hostname total=17179869184i,available_percent=5.273199081420898,active=885673984i,free=24256512i,wired=659767296i,available=905928704i,used=16273940480i,used_percent=94.7268009185791,inactive=881672192i 1694532100000000000

Additional info

No response

About this issue

Original URL
State: closed
Created 10 months ago
Comments: 41 (41 by maintainers)

Commits related to this issue

fix(outputs.stackdriver): Do not shallow copy map This does not shallow copy the resource label map and instead makes a deep copy of the values. This was causing additional metrics to end up with all... — committed to powersj/telegraf by powersj 9 months ago

Most upvoted comments

I lied, I couldn’t wait, lol.

Initial testing looks good. I will have to do a deep dive of the metrics, but so far so good!

crflanigan on Sep 27, 2023

Well the debugger might be our savior here. Makes me wonder who is actually using this plugin besides you 😉

# created and now adding time series
metric:{type:"test_mem_value/unknown"}  resource:{type:"global"  labels:{key:"job"  value:"mem"}  labels:{key:"project_id"  value:"projects/[PROJECT]"}}  metric_kind:GAUGE  points:{interval:{end_time:{seconds:1694386800}}  value:{double_value:100}}
map[job:disk project_id:projects/[PROJECT]]

# created and now adding time series
metric:{type:"test_disk_value/unknown"}  resource:{type:"global"  labels:{key:"job"  value:"disk"}  labels:{key:"project_id"  value:"projects/[PROJECT]"}}  metric_kind:GAUGE  points:{interval:{end_time:{seconds:1694386800}}  value:{double_value:42}}

# state of the time series right before we send:
[metric:{type:"test_mem_value/unknown"}  resource:{type:"global"  labels:{key:"job"  value:"disk"}  labels:{key:"project_id"  value:"projects/[PROJECT]"}}  metric_kind:GAUGE  points:{interval:{end_time:{seconds:1694386800}}  value:{double_value:100}}]
[metric:{type:"test_disk_value/unknown"}  resource:{type:"global"  labels:{key:"job"  value:"disk"}  labels:{key:"project_id"  value:"projects/[PROJECT]"}}  metric_kind:GAUGE  points:{interval:{end_time:{seconds:1694386800}}  value:{double_value:42}}]

powersj on Sep 26, 2023

Do you think this will make it into the Oct 2nd release?

yep!

powersj on Sep 27, 2023

ate lunch, came back

heh the best type of debugging 😃

powersj on Sep 27, 2023

I wondered, but nothing catches my eye. Once we create the time series we add it to bucket for each metric. Then we split the buckets into 200 at a time and send them, see here.

Let me know what you hear.

powersj on Sep 12, 2023

See #13912 artifacts in 20-30mins

powersj on Sep 12, 2023