kapacitor: Simple TICKscript that will always error

Duplicate of https://github.com/influxdata/telegraf/issues/2444

Bug report

Hey. For some reason, the system measurement (load1,5,15, uptime, etc) is sent by telegraf as two distinct lines.

system,host=TICKAlerta load1=0,load5=0.03,load15=0.03,n_users=1i,n_cpus=2i 1487579590000000000
system,host=TICKAlerta uptime_format=" 0:13",uptime=807i 1487579590000000000

This essentially breaks stream processing with kapacitor of system measurement, as relevant field is missing 50% of the times.

E! error evaluating expression for level CRITICAL: no field or tag exists for load1

This can be verified while looking into datapoints, as seen by kapacitor

{
 "Name": "system",
 "Database": "telegraf",
 "RetentionPolicy": "default",
 "Group": "host=host1.ex",
 "Dimensions": {
   "ByName": false,
   "TagNames": [
     "host"
   ]
 },
 "Tags": {
   "environment": "offsite",
   "host": "host1.ex",
   "osname": "Ubuntu",
   "virtual": "physical"
 },
 "Fields": {
   "load1": 0,
   "load15": 0.05,
   "load5": 0.01,
   "n_cpus": 4,
   "n_users": 0
 },
 "Time": "2017-02-14T12:38:40Z"
}
{
 "Name": "system",
 "Database": "telegraf",
 "RetentionPolicy": "default",
 "Group": "host=host1.ex",
 "Dimensions": {
   "ByName": false,
   "TagNames": [
     "host"
   ]
 },
 "Tags": {
   "environment": "offsite",
   "host": "host1.ex",
   "osname": "Ubuntu",
   "virtual": "physical"
 },
 "Fields": {
   "uptime": 5278035,
   "uptime_format": "61 days,  2:07"
 },
 "Time": "2017-02-14T12:38:40Z"
}

Relevant telegraf.conf:

[global_tags]
[agent]
 interval = "10s"
 round_interval = true

 metric_batch_size = 1000

 metric_buffer_limit = 10000

 collection_jitter = "0s"

 flush_interval = "10s"
 flush_jitter = "0s"

 precision = ""

 debug = false
 quiet = false
 logfile = ""

 hostname = ""
 omit_hostname = false

[[outputs.file]]
  files = ["stdout", "/tmp/metrics.out"]

  data_format = "influx"

[[inputs.system]]
 # no configuration

System info:

As far as I know, present in telegraf versions 1.0, 1.1 and 1.2. Tested on Ubuntu and Debian LTS versions (precise, trusty, xenial, jessie).

Steps to reproduce:

Telegraf

Use the included telegraf config file.

telegraf --config telegraf.conf --debug
cat /tmp/metrics.out

Kapacitor

var warn_threshold = 4
var crit_threshold = 10

var period = 1h
var every = 1m

var data = stream
 |from()
   .database('telegraf')
   .retentionPolicy('default')
   .measurement('system')
   .groupBy('host')
 |log()
 |window()
   .period(period)
   .every(every)
 |last('load1')
   .as('stat')
grep load1 /var/log/kapacitor/kapacitor.log

Expected behavior:

Single line for system measurement.

Actual behavior:

Two distinct lines for system measurement

Since the data for a series is split into two lines, InfluxDB forwards it on to Kapacitor as two lines and therefore its possible to write a TICKscript that will perpetually fail

var data = stream
    |from()
        .measurement('system')
    |eval(lambda: "load1" / "uptime")
        .as('ratio')

with

kapacitor define example -tick example.tick -dbrp telegraf.autogen -type stream
kapacitor enable example

will perpetually error out, even though all of the data would exist.

As I see it, there are four ways to solve this problem.

  1. Kapacitor can do deduplication of incoming lines with common series key and timestamp.
  2. InfluxDB can do deduplication of data it reports in subscriptions for lines with common series key and timestamp.
  3. Telegraf can report data as a single line
  4. We can provide documentation that notes this as a rough edge and give users a work around.

I had initially directed @markuskont to opening an issue on Telegraf, but there’s definitely more than one way to solve this issue.

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 4
  • Comments: 21 (10 by maintainers)

Most upvoted comments

I think there’s a larger question here about how/where we should talk about these kinds of issues that have cross platform implications.

IMO buffering data like this should not be the default functionality. It should be opt-in, we could add a dedupe node that wants a stream and provides a stream where there is a specific window of time where the data will be buffered.