telegraf: Outputs.Stackdriver `tags_as_resource_labels` config option doesn't appear to be working as expected

Relevant telegraf.conf

## Relevant Agent Configuration

[inputs.mem]
  [inputs.mem.tags]
    job = "inputs.mem"

[inputs.processes]
  [inputs.processes.tags]
    job = "inputs.processes"

[[outputs.opentelemetry]]
  service_address = "server:4317"


## Relevant Server Config
## Server (Values that are {{}} are hydrated by a key value store when
## Telegraf is deployed):

[global_tags]

# Configuration for telegraf agent
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 199
  metric_buffer_limit = {{.METRIC_BUFFER}}
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = {{.DEBUG}}
  quiet = false

# Receive OpenTelemetry traces, metrics, and logs over gRPC
[[inputs.opentelemetry]]
  ## Override the default (0.0.0.0:4317) destination OpenTelemetry gRPC service
  ## address:port
  service_address = "0.0.0.0:4317"

[[processors.regex]]
  # Rename metric names (measurement names) that don't match Prometheus requirements
  [[processors.regex.metric_rename]]
    pattern = "[^a-zA-Z0-9_]+"
    replacement = "_"

  # Rename tag keys that don't match Prometheus requirements
  [[processors.regex.tag_rename]]
    pattern = "[^a-zA-Z0-9_]+"
    replacement = "_"

  # Rename field keys that don't match Prometheus requirements
  [[processors.regex.field_rename]]
    pattern = "[^a-zA-Z0-9_]+"
    replacement = "_"

    # Configuration for sending metrics to GMP
[[outputs.stackdriver]]
  project = "{{.PROJECT_ID}}"
  resource_type = "prometheus_target"
  metric_name_format = "official"
  metric_data_type = "double"
  metric_type_prefix = "prometheus.googleapis.com"
  tags_as_resource_label = ["instance", "job"]
  # Ignore metrics from inputs.internal
  namedrop = ["internal_*"]
  [outputs.stackdriver.resource_labels]
    cluster = "{{.CLUSTER_NAME}}"
    job = "Telegraf"
    instance = "{{.CLUSTER_NAME}}"
    location = "{{.LOCATION}}"
    namespace = "{{.NAMESPACE_LABEL}}"

Logs from Telegraf

The logs are unremarkable, but when outputting what the agent is sending to outputs.stackdriver using outputs.file I see that the tags are set correctly.

Example: Memory metric has job=inputs.mem Processes metric has job=inputs.processes

mem,env=Production_MacOS,host=hostname,instance=hostname,job=inputs.mem active=965160960i,available=1003286528i,used_percent=94.1601037979126,available_percent=5.839896202087402,inactive=954462208i,wired=693956608i,total=17179869184i,used=16176582656i,free=48824320i 1694460660000000000
processes,env=Production_MacOS,host=hostname,instance=hostname,job=inputs.processes blocked=0i,zombies=1i,stopped=0i,running=3i,sleeping=428i,total=432i,unknown=0i,idle=0i 1694460660000000000

But when I look at the metrics in Google or Grafana I notice that it looks like the metrics are sometimes tagged with another inputs tag, IE job=inputs.mem but for a processes metric. From what I can tell it’s usually the busiest metric. Additionally, it appears to be the last metric’s tag that is sent is the one that is used for all the entire batch of metrics. Including metrics that don’t contain a job tag. Meaning that the default value isn’t applied, instead whatever the last value for the tag is used.

### System info

1.27.4

### Docker

_No response_

Steps to reproduce

  1. Send metrics from a Telegraf client with a tag called job configured for each input plugin to a Telegraf server
  2. Configure the Telegraf server such that it uses tags_as_resource_labels for job tag.
  3. Send metrics to Google …

Expected behavior

Each metric will have their job tag (or any other applicable tag) applied as a resource label.

Actual behavior

The tag appears to change depending on what the last tag was sent. As you can see in the screenshot the job tag changed between inputs.mem and inputs.processes, even though the sending agent didn’t change. Here’s an example of it changing applying a mismatched job label on two different inputs:

image

image

What also seems weird is that you can stop sending a tag altogether and the last sent tag is what continues to be sent instead of the default value:

[inputs.mem]
  # [inputs.mem.tags]
  #   job = "inputs.mem"

[inputs.processes]
  # [inputs.processes.tags]
  #   job = "inputs.processes"

Telegraf Client outputs.file output showing no job tag:

mem,env=Production_MacOS,host=hostname,instance=hostname total=17179869184i,available_percent=5.273199081420898,active=885673984i,free=24256512i,wired=659767296i,available=905928704i,used=16273940480i,used_percent=94.7268009185791,inactive=881672192i 1694532100000000000
image

Additional info

No response

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 41 (41 by maintainers)

Commits related to this issue

Most upvoted comments

I lied, I couldn’t wait, lol.

Initial testing looks good. I will have to do a deep dive of the metrics, but so far so good!

Well the debugger might be our savior here. Makes me wonder who is actually using this plugin besides you 😉

# created and now adding time series
metric:{type:"test_mem_value/unknown"}  resource:{type:"global"  labels:{key:"job"  value:"mem"}  labels:{key:"project_id"  value:"projects/[PROJECT]"}}  metric_kind:GAUGE  points:{interval:{end_time:{seconds:1694386800}}  value:{double_value:100}}
map[job:disk project_id:projects/[PROJECT]]

# created and now adding time series
metric:{type:"test_disk_value/unknown"}  resource:{type:"global"  labels:{key:"job"  value:"disk"}  labels:{key:"project_id"  value:"projects/[PROJECT]"}}  metric_kind:GAUGE  points:{interval:{end_time:{seconds:1694386800}}  value:{double_value:42}}

# state of the time series right before we send:
[metric:{type:"test_mem_value/unknown"}  resource:{type:"global"  labels:{key:"job"  value:"disk"}  labels:{key:"project_id"  value:"projects/[PROJECT]"}}  metric_kind:GAUGE  points:{interval:{end_time:{seconds:1694386800}}  value:{double_value:100}}]
[metric:{type:"test_disk_value/unknown"}  resource:{type:"global"  labels:{key:"job"  value:"disk"}  labels:{key:"project_id"  value:"projects/[PROJECT]"}}  metric_kind:GAUGE  points:{interval:{end_time:{seconds:1694386800}}  value:{double_value:42}}]

Do you think this will make it into the Oct 2nd release?

yep!

ate lunch, came back

heh the best type of debugging 😃

I wondered, but nothing catches my eye. Once we create the time series we add it to bucket for each metric. Then we split the buckets into 200 at a time and send them, see here.

Let me know what you hear.

See #13912 artifacts in 20-30mins