telegraf: A Metric that is Rejected by Google API Blocks new Metrics from Being Sent

Relevant telegraf.conf

#### Abbreviated Server Config
    [global_tags]

    # Configuration for telegraf agent
    [agent]
      interval = "10s"
      round_interval = true
      metric_batch_size = 199
      metric_buffer_limit = 10000
      collection_jitter = "0s"
      flush_interval = "10s"
      flush_jitter = "0s"
      precision = ""
      debug = true
      quiet = false

[[outputs.stackdriver]]
  project = "PROJECT"
  resource_type = "prometheus_target"
  metric_name_format = "official"
  metric_data_type = "double"
  metric_type_prefix  = "prometheus.googleapis.com" 
  tags_as_resource_label = ["instance"]
  [outputs.stackdriver.resource_labels]
    cluster = "onprem"
    job = "test"
    instance = "COMPUTERNAME"
    location = "us-east1-b"
    namespace = "store"

# Receive OpenTelemetry traces, metrics, and logs over gRPC
[[inputs.opentelemetry]]
  service_address = "0.0.0.0:443"

## Client Agent Config

[global_tags]
env = "Production_MacOS"
# rack = "1a"
# Environment variables can be used as tags, and throughout the config file
# user = "$USER"
# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "1m"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true
  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 50
  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  ## This buffer only fills when writes fail to output plugin(s).
  metric_buffer_limit = 10000
  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"
  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"
  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""
  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = true
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = ""
  ## The logfile will be rotated after the time interval specified.  When set
  ## to 0 no time based rotation is performed.
  logfile_rotation_interval = "24h"
  ## The logfile will be rotated when it becomes larger than the specified
  ## size.  When set to 0 no size based rotation is performed.
  logfile_rotation_max_size = "50MB"
  ## Maximum number of rotated archives to keep, any older logs are deleted.
  ## If set to -1, no archives are removed.
  logfile_rotation_max_archives = 1
  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

[inputs.disk]

[inputs.mem]

[[outputs.opentelemetry]]
  service_address = "telegrafServer:443"

Logs from Telegraf

[2023-08-23T21:23:04Z E! [agent] Error writing to outputs.stackdriver: rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{location:us-central1-a,instance:COMPUTERNAME,namespace:store,cluster:st0000,job:test} timeSeries[0-71]: prometheus.googleapis.com/disk_inodes_total_gauge/gauge{mode:ro,fstype:apfs,path:/,env:Production_MacOS,host:COMPUTERNAME,device:disk3s1s1}

### System info

1.27.4

### Docker

_No response_

### Steps to reproduce

1.  Buffer metrics on the client side (seems like disk is more effected, but not sure about that)
2. Restore network connectivity to the client so that the buffer can flush.
3. Metrics receive the Telegraf server but one or more are rejected for whatever reason by Google API
4. Buffer reaches the buffer limit and starts dropping the older metrics until the offending metrics are dropped and then the buffer flushes the rest to Google.
...


### Expected behavior

Metrics buffer on the client, connection restores to the server, metrics that are rejected more than a few times are dropped so the buffer may flush normally.

### Actual behavior

Metrics on the server continue to buffer until the rejected metrics are dropped from the buffer, then the server works normally.

Additional info

During lunch my laptop was sending metrics into a Telegraf server in Kubernetes. During this time my computer went to sleep and I lost my VPN connection. I was monitoring the Telegraf server in Grafana and noticed that the metrics written counter wasn’t increasing (scraped by Prometheus so collection of Telegraf health metrics is out of band from standard collection). Once I realized my Telegraf agent on my laptop was buffering I reconnected the VPN, and all the metrics flushed from my Agent’s buffer to the Server.

When I checked the logs for the Telegraf Server I noticed the following error message:

[2023-08-23T21:23:04Z E! [agent] Error writing to outputs.stackdriver: rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{location:us-central1-a,instance:COMPUTERNAME,namespace:store,cluster:st0000,job:test} timeSeries[0-71]: prometheus.googleapis.com/disk_inodes_total_gauge/gauge{mode:ro,fstype:apfs,path:/,env:Production_MacOS,host:COMPUTERNAME,device:disk3s1s1}

This single metric seems to be causing all metrics from the plugin to fail to flush.

Additionally, I was sending both Memory and Disk metrics, but only the disk metrics appeared to have this problem.

More info:

About this issue

Original URL
State: closed
Created 10 months ago
Comments: 31 (31 by maintainers)

Commits related to this issue

fix(outputs.stackdriver): Update error messages to drop metrics fixes: #13826 — committed to powersj/telegraf by powersj 10 months ago
fix(outputs.stackdriver): Drop metrics on any invalid argument fixes: #13826 — committed to powersj/telegraf by powersj 10 months ago

Most upvoted comments

Thank you for the testing, looks like we are headed in the right direction

The thing that sort of seems weird is that the log message indicates that stackdriver wrote the metrics instead of dropping them.

Sven and I have previously discussed this message and behavior before. As-is, we say we successfully wrote metrics whenever we do not get an error. Ideally, we would have two types of errors, retryable and non-retryable, but that is not something we have gotten around to.

This is why I changed the message in stackdriver from a debug message, which most would miss, to a warning so that the logs at least show that, no, you did not write successfully, and we won’t be retrying.

I wonder if it would make more sense to just drop when the agent see this:

You are on the same page as me. I wanted to think about this over the weekend as I hate parsing error messages like this. I briefly wondered if we could drop metrics anytime we got an error that was not related to connectivity, but that is too broad.

Let me think about a better way to handle this and get back to you early next week.

powersj on Sep 15, 2023

Submitted a new issue per your request. Have a great day!

Thank you! you too!

powersj on Aug 24, 2023