influxdb: non_negative_derivative and group by sizing causing large spikes

Hi, We’re using InfluxDB 0.12.0 and Grafana 3.0.0 beta31460390657. We’re monitoring systems that output continually increasing counters (Lustre) via Telegraf. The metrics are being collected by Telegraf every 10s.

For smaller time ranges the graphs are accurate. However for larger time periods InfluxDB sends back data with a very large spike at the beginning. I’ve noticed this seems to stem from the larger GROUP BY time() values that Grafana selects when you view data over a larger time period.

If I manually select a GROUP BY time(10s) the data is always correct. If I let Grafana auto-select with a GROUP BY time($interval), I see the issue. Note that the issue is periodic - sometimes the large interval works and sometimes it doesn’t. I’m guessing this is because I’m using now() in my queries so at certain times the times align correctly.

So to cut Grafana out of the equation I’ve manually run the queries to illustrate the issue. If I use GROUP BY time(10s), this query always returns correct data no matter how many times I try it:

> SELECT non_negative_derivative(sum("read_bytes"), 1s) FROM "lustre2" WHERE time > now() - 6h GROUP BY time(10s) fill(null) LIMIT 10;
name: lustre2
-------------
time                    non_negative_derivative
1461153160000000000     0
1461153170000000000     0
1461153180000000000     0
1461153190000000000     0
1461153200000000000     0
1461153210000000000     0
1461153220000000000     0
1461153230000000000     0
1461153240000000000     0
1461153250000000000     0

However, if I increase to GROUP BY time(30s) and run two queries that were less than a second apart:

> SELECT non_negative_derivative(sum("read_bytes"), 1s) FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) fill(null) LIMIT 10;
name: lustre2
-------------
time                    non_negative_derivative
1461153210000000000     0
1461153240000000000     0
1461153270000000000     0
1461153300000000000     0
1461153330000000000     0
1461153360000000000     0
1461153390000000000     0
1461153420000000000     0
1461153450000000000     0
1461153480000000000     0

> SELECT non_negative_derivative(sum("read_bytes"), 1s) FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) fill(null) LIMIT 10;
name: lustre2
-------------
time                    non_negative_derivative
1461153210000000000     4.387031397146e+11
1461153240000000000     0
1461153270000000000     0
1461153300000000000     0
1461153330000000000     0
1461153360000000000     0
1461153390000000000     0
1461153420000000000     0
1461153450000000000     0
1461153480000000000     0

In fact if I run this query every second, in seconds 1 - 10 the first value is 0, in seconds 11-20, the first value is a very large number, and in seconds 21 - 30, the first value is an even larger number (almost exactly double). Then the first value returns to 0 after that and the cycle starts again. This is due to the sum values being very different (three queries run with about 10s in between):

> SELECT sum("read_bytes") FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) LIMIT 10;
name: lustre2
-------------
time                    sum
1461153630000000000     13161094191438
1461153660000000000     39483282574314
1461153690000000000     39483282574314
1461153720000000000     39483282574314
1461153750000000000     39483282574314
1461153780000000000     39483282574314
1461153810000000000     39483282574314
1461153840000000000     39483282574314
1461153870000000000     39483282574314
1461153900000000000     39483282574314

> SELECT sum("read_bytes") FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) LIMIT 10;
name: lustre2
-------------
time                    sum
1461153660000000000     26322188382876
1461153690000000000     39483282574314
1461153720000000000     39483282574314
1461153750000000000     39483282574314
1461153780000000000     39483282574314
1461153810000000000     39483282574314
1461153840000000000     39483282574314
1461153870000000000     39483282574314
1461153900000000000     39483282574314
1461153930000000000     39483282574314

> SELECT sum("read_bytes") FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) LIMIT 10;
name: lustre2
-------------
time                    sum
1461153660000000000
1461153690000000000     39483282574314
1461153720000000000     39483282574314
1461153750000000000     39483282574314
1461153780000000000     39483282574314
1461153810000000000     39483282574314
1461153840000000000     39483282574314
1461153870000000000     39483282574314
1461153900000000000     39483282574314
1461153930000000000     39483282574314

When I look at the raw data points they are staying constant as expected since there is no activity right now (using one host as an example):

> SELECT read_bytes FROM "lustre2" WHERE "host"='hpcleo00' GROUP BY "name" LIMIT 10;
name: lustre2
tags: name=lustree-OST0000
time                    read_bytes
----                    ----------
1459977920000000000     39325212672
1459977930000000000     39325212672
1459977940000000000     39325212672
1459977950000000000     39325212672
1459977960000000000     39325212672
1459977970000000000     39325212672
1459977980000000000     39325212672
1459977990000000000     39325212672
1459978000000000000     39325212672
1459978010000000000     39325212672

name: lustre2
tags: name=lustree-OST0001
time                    read_bytes
----                    ----------
1459977920000000000     39718395904
1459977930000000000     39718395904
1459977940000000000     39718395904
1459977950000000000     39718395904
1459977960000000000     39718395904
1459977970000000000     39718395904
1459977980000000000     39718395904
1459977990000000000     39718395904
1459978000000000000     39718395904
1459978010000000000     39718395904

name: lustre2
tags: name=lustree-OST0002
time                    read_bytes
----                    ----------
1459977920000000000     59411320832
1459977930000000000     59411320832
1459977940000000000     59411320832
1459977950000000000     59411320832
1459977960000000000     59411320832
1459977970000000000     59411320832
1459977980000000000     59411320832
1459977990000000000     59411320832
1459978000000000000     59411320832
1459978010000000000     59411320832
name: lustre2
tags: name=lustree-OST0003
time                    read_bytes
----                    ----------
1459977920000000000     59857276928
1459977930000000000     59857276928
1459977940000000000     59857276928
1459977950000000000     59857276928
1459977960000000000     59857276928
1459977970000000000     59857276928
1459977980000000000     59857276928
1459977990000000000     59857276928
1459978000000000000     59857276928
1459978010000000000     59857276928

name: lustre2
tags: name=lustree-OST0004
time                    read_bytes
----                    ----------
1459977920000000000     39734607872
1459977930000000000     39734607872
1459977940000000000     39734607872
1459977950000000000     39734607872
1459977960000000000     39734607872
1459977970000000000     39734607872
1459977980000000000     39734607872
1459977990000000000     39734607872
1459978000000000000     39734607872
1459978010000000000     39734607872

Note that I have also tried all the fill options (none, null, 0, previous) and they do not help.

I’ve read through https://github.com/influxdata/influxdb/issues/4237, https://github.com/influxdata/influxdb/pull/4292, and all the linked issues, but I think this is different.

Thanks, Shawn

About this issue

Original URL
State: open
Created 8 years ago
Reactions: 19
Comments: 38 (3 by maintainers)

Most upvoted comments

Are there plans to fix this? I have to add offset 1 to all queries doing non_negative_derivative(sum("value"), 10s) to not get the spike as the first value.

+27

panzerdev on Oct 23, 2016

We are hitting this, too. It’s frustrating; we have hundreds of graphs of rates (non_negative_derivative), and this bug is rendering them fairly useless. Can we get a sense of whether this will be fixed?

+12

RonRothman on Nov 4, 2016

Still present and still problematic

hildeb on Oct 23, 2019

I’ve been looking over this problem as a user, and it doesn’t appear that this is an actual bug. What’s happening is that when you set upper/lower bounds on the time range using TIME() plus or minus some interval, and that time range doesn’t 100% align with the buckets in your GROUP BY interval, SUM() ends up summing less points into the first and last buckets than you expect – this is actually the case regardless of whether derivatives are involved. As a result, both the first and last data points being differenced are smaller than expected. When you use DERIVATIVE(), this manifests as a huge jump between that first bucket value, based on incomplete data, and the second bucket value, based on complete data (and vice versa at the end). As a result, you start the series with an enormous positive spike and end it with an enormous negative one.

To get the results we want, what we really need is a way to normalize the time boundaries on the query so they align with the bucket width. This is probably a task best left to Grafana or another dashboard.

jgoldschrafe on Apr 29, 2017

Still seeing the issue. Spikes at start and end of graphs are still visible when doing a group by on time.

piyushGoyal2 on Feb 4, 2019

Hitting this issue as well in InfluxDB 1.2.4. non_negative_derivative with sum() as an aggregator creates a huge spike at the beginning of the graph and makes the rest of the chart unreadable. Based on @jgoldschrafe’s comment above, it seems like this might be related to #5943 and/or #6878 .

screenshot from 2017-10-02 17-44-29

rhefner1 on Oct 2, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on Jul 23, 2019

@hardiksondagar There you go

 SELECT non_negative_derivative(sum("value"), 1s) AS "qps" FROM "jetty.responses2xx" WHERE 
"cluster" = 'uluru-api' AND $timeFilter GROUP BY time(1m), "cluster" offset 1

jiacai2050 on Nov 15, 2018

I was encountering this same issue in InfluxDB v2 using the Flux language as well. I know the limit() function was introduced to counter this but it did not solve the issues I was seeing described here.

What ended up fixing it for me is changing the order in which I group and then run the derivative function. |> filter(fn: (r) => r "some data") |> derivative(unit: 1s, nonNegative: true, columns: ["_value"], timeColumn: "_time") |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> yield(name: "mean")

Previously my derivative and aggregatewindow function were swapped order wise and that resulted in the large spike in a revolving 10 second window. Now after swapping them this has seemingly fixed this.

I know this thread was primarily about the InfluxQL rather than Flux but I ran into this thread when trying to solve it in Flux and there was some great info in here which lead me to the solve. Hope this helps someone!

SeanMc55 on Aug 15, 2023

I have same problem. @panzerdev , thanks for the offset idea.

alekseyp on Nov 19, 2016

One way to solve this in Grafana, at least for difference() queries, is to have a nested query, like this: SELECT sum("Entries") FROM (SELECT difference(sum("titanEdgeViewEntryCount")) AS "Entries" FROM "titanEdgeViewTable" WHERE ("sysName" =~ /^$hostname$/) AND $timeFilter GROUP BY time(1m), "titanEdgeViewName", "sysName" fill(previous) offset 1) group by time($int), "titanEdgeViewName", "sysName" You group by the default time interval in the inner query, and you can then sum the differences and group by another time interval in the outer query. I have not tested it on derivate() or non_negative_derivate() queries.

erlfos on Feb 24, 2021

I managed to get rid of those spikes by using percentile(95) selector instead of mean() aggregation. So if this makes sense for your data you can use this workaround.

beriba on Mar 9, 2018

A lot of useful information here! my today’s workaround for this is to apply specific time range in Grafana. Usually the auto interval for longer time ranges is 5 minutes. Then to ensure time range does not cross the intervals we use /h to round it to whole hours. Unfortunately it means also that from 7:00:01 to 8:00:00 the latest data you see is from 6:59:59 and it could be a serious issue for some of you.

From: now-2d/h
To: now-1h-1s/h

It would be cool to have in Grafana sth like /5m or /$__interval.

to-masz on Feb 16, 2018

I have same problem. Setting offset 1 and the time of group by Is greater than 20 can solve it. thanks the offset idea @panzerdev

janceChun on Mar 8, 2017