telegraf: Startup failure with caching on 1.19

Relevant telegraf.conf:

###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################
  
# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  ## The full HTTP or UDP endpoint URL for your InfluxDB instance.
  ## Multiple urls can be specified as part of the same cluster,
  ## this means that only ONE of the urls will be written to each interval.
  # urls = ["udp://localhost:8089"] # UDP endpoint example
  urls = ["http://hostname:8086"] # required
  ## The target database for metrics (telegraf will create it if not exists).
  database = "metrics" # required
  
  ## Retention policy to write to. Empty string writes to the default rp.
  retention_policy = ""
  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
  write_consistency = "any"
  
  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "5s"
  username=""
  password=""

  ## Set the user agent for HTTP POSTs (can be useful for log differentiation)
  # user_agent = "telegraf"
  ## Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes)
  # udp_payload = 512
  
  ## Optional SSL Config
  # ssl_ca = "/etc/telegraf/ca.pem"
  # ssl_cert = "/etc/telegraf/cert.pem"
  # ssl_key = "/etc/telegraf/key.pem"
  ## Use SSL but skip chain & host verification
  # insecure_skip_verify = false
  
  
###############################################################################
#                            PROCESSOR PLUGINS                                #
###############################################################################


# # Print all metrics that pass through this filter.
# [[processors.printer]]



###############################################################################
#                            AGGREGATOR PLUGINS                               #
###############################################################################

# # Keep the aggregate min/max of each metric passing through.
# [[aggregators.minmax]]
#   ## General Aggregator Arguments:
#   ## The period on which to flush & clear the aggregator.
#   period = "30s"
#   ## If true, the original metric will be dropped by the
#   ## aggregator and will not get sent to the output plugins.
#   drop_original = false



###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics.
  collect_cpu_time = false
  
# Read metrics about system load & uptime
[[inputs.system]]
  # no configuration
  
# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration

[[inputs.net]]
  interfaces = ["ens5"]

# Read metrics about system load & uptime
# Statsd Server
[[inputs.statsd]]
  ## Address and port to host UDP listener on
  service_address = ":8125"
  
  ## The following configuration options control when telegraf clears it's cache
  ## of previous values. If set to false, then telegraf will only clear it's
  ## cache when the daemon is restarted.
  ## Reset gauges every interval (default=true)
  delete_gauges = true
  ## Reset counters every interval (default=true)
  delete_counters = true
  ## Reset sets every interval (default=true)
  delete_sets = true
  ## Reset timings & histograms every interval (default=true)
  delete_timings = true
  
  ## Percentiles to calculate for timing & histogram stats
  percentiles = [90]
  
  ## separator to use between elements of a statsd metric
  metric_separator = "."
  
  ## Parses datadog extensions to the statsd format
  datadog_extensions = true

  ## Statsd data translation templates, more info can be read here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md#graphite
  # templates = [
  #     "cpu.* measurement*"
  # ]

  ## Number of UDP messages allowed to queue up, once filled,
  ## the statsd server will start dropping packets
  allowed_pending_messages = 10000
  
  ## Number of timing/histogram values to track per-measurement in the
  ## calculation of percentiles. Raising this limit increases the accuracy
  ## of percentiles but also increases the memory usage and cpu time.
  percentile_limit = 1000
  
# Generic socket listener capable of handling multiple socket types.
[[inputs.socket_listener]]
  ## URL to listen on
  service_address = "udp://:8126"
  
  ## Maximum socket buffer size (in bytes when no unit specified).
  ## For stream sockets, once the buffer fills up, the sender will start backing up.
  ## For datagram sockets, once the buffer fills up, metrics will start dropping.
  ## Defaults to the OS default.
  read_buffer_size = "16MiB"
    
  ## Data format to consume.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
  data_format = "influx"
  
  tag_keys = [
    "server",
    "file_transfer_state",
    "file_transfer_error"
  ] 

System info:

Telegraf 1.19, Ubuntu 18.04

Steps to reproduce:

  1. … Install telegraf 1.18 with above config using influx apt repository
  2. … Things work
  3. Upgrade to telegraf 1.9
  4. Look for error in /var/log/syslog

Expected behavior:

Telegraf should continue to run as it did

Actual behavior:

failed to open. Ignored. open /etc/telegraf/.cache/snowflake/ocsp_response_cache.json: no such file or directory\n" func=“gosnowflake.(*defaultLogger).Errorf” file=“log.go:120”

Reverting to 1.18 fixes the problem.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 5
  • Comments: 21 (5 by maintainers)

Most upvoted comments

Guys… as someone with Linux administration background:

  1. /etc/telegraf probably shouldn’t be the default home for telegraf user, especially if you want to keep cache files there. Depending on the stuff you want to keep in there I would vote for something like: /var/lib/telegraf/

  2. it would be great to have .cache/ directory independent and configurable in telegraf.conf, default should be something like this: /var/cache/telegraf/ or /var/lib/telegraf/cache

Looks like this is caused by the addition of the SQL output plugin in #9280.

The gosnowflake library uses a directory for caching OCSP, and the directory is expected to follow the XDG Base Directory Specification, but rather than trying to get the cach dir path from an environment variable, it is just hard coded to use the default path of $HOME/.cache.

This then conflicts with the preinstall scripts in the packaged versions of telegraf, which creates the telegraf user, with the home dir set to /etc/telegraf, which is not writable (and nor should it be).

...
   7   │ if ! id telegraf &>/dev/null; then
   8   │     useradd -r -M telegraf -s /bin/false -d /etc/telegraf -g telegraf
   9   │ fi
...

So what is the solution to this problem? Just change the access rights for the telegraf path?

I have changed the config from percentiles = [90] to percentiles = [90.0] and indeed it does start up now. Thank you for the suggestion.

I’m facing the same problem with Telegraf 1.19.3 (git: HEAD a799489f). The daemon hangs in a loop and is not able to send metrics.

Aug 28 00:16:13 <hostname> systemd[1]: telegraf.service: Failed with result 'exit-code'.
Aug 28 00:16:13 <hostname> systemd[1]: telegraf.service: Scheduled restart job, restart counter is at 4.
Aug 28 00:16:13 <hostname> systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
Aug 28 00:16:13 <hostname> systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Aug 28 00:16:13 <hostname> telegraf[35470]: time="2021-08-28T00:16:13+02:00" level=error msg="failed to create cache directory. /etc/telegraf/.cache/snowflake, err: mkdir /etc/telegraf/.cache: permission denied. ignored\n" func="gosnowflake.(*defaultLogger).Errorf" file="log.go:120"
Aug 28 00:16:13 <hostname> telegraf[35470]: time="2021-08-28T00:16:13+02:00" level=error msg="failed to open. Ignored. open /etc/telegraf/.cache/snowflake/ocsp_response_cache.json: no such file or directory\n" func="gosnowflake.(*defaultLogger).Errorf" file="log.go:120"
Aug 28 00:16:13 <hostname> telegraf[35470]: 2021-08-27T22:16:13Z I! Starting Telegraf 1.19.3
Aug 28 00:16:13 <hostname> telegraf[35470]: 2021-08-27T22:16:13Z E! [telegraf] Error running agent: Error loading config file /etc/telegraf/telegraf.conf: open /etc/telegraf/telegraf.conf: permission denied
Aug 28 00:16:13 <hostname> systemd[1]: telegraf.service: Main process exited, code=exited, status=1/FAILURE
Aug 28 00:16:13 <hostname> systemd[1]: telegraf.service: Failed with result 'exit-code'.
Aug 28 00:16:14 <hostname> systemd[1]: telegraf.service: Scheduled restart job, restart counter is at 5.
Aug 28 00:16:14 <hostname> systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
Aug 28 00:16:14 <hostname> systemd[1]: telegraf.service: Start request repeated too quickly.
Aug 28 00:16:14 <hostname> systemd[1]: telegraf.service: Failed with result 'exit-code'.

Glad to hear this is not preventing telegraf from starting. It looks like the gosnowflake errors are a nuisance but not something that prevents existing .deb users from upgrading to 1.19.0. Since the extra errors aren’t critical I won’t remove snowflake. I’ll start looking for a solution to use snowflake in a way that doesn’t cause errors, maybe by providing another writable directory or disabling the OCSP caching.

It looks like in 1.19.0 the statsd plugin changed the type of the percentiles in the toml config. This isn’t a backward compatible change so it’s a bug. It was changed in #8969 as part of a cleanup refactor. I filed a separate issue for the statsd percentile problem and I’ll leave this issue open to remind us to fix the gosnowflake errors.