influxdb: non_negative_derivative and group by sizing causing large spikes
Hi, We’re using InfluxDB 0.12.0 and Grafana 3.0.0 beta31460390657. We’re monitoring systems that output continually increasing counters (Lustre) via Telegraf. The metrics are being collected by Telegraf every 10s.
For smaller time ranges the graphs are accurate. However for larger time periods InfluxDB sends back data with a very large spike at the beginning. I’ve noticed this seems to stem from the larger GROUP BY time() values that Grafana selects when you view data over a larger time period.
If I manually select a GROUP BY time(10s) the data is always correct. If I let Grafana auto-select with a GROUP BY time($interval), I see the issue. Note that the issue is periodic - sometimes the large interval works and sometimes it doesn’t. I’m guessing this is because I’m using now() in my queries so at certain times the times align correctly.
So to cut Grafana out of the equation I’ve manually run the queries to illustrate the issue. If I use GROUP BY time(10s), this query always returns correct data no matter how many times I try it:
> SELECT non_negative_derivative(sum("read_bytes"), 1s) FROM "lustre2" WHERE time > now() - 6h GROUP BY time(10s) fill(null) LIMIT 10;
name: lustre2
-------------
time non_negative_derivative
1461153160000000000 0
1461153170000000000 0
1461153180000000000 0
1461153190000000000 0
1461153200000000000 0
1461153210000000000 0
1461153220000000000 0
1461153230000000000 0
1461153240000000000 0
1461153250000000000 0
However, if I increase to GROUP BY time(30s) and run two queries that were less than a second apart:
> SELECT non_negative_derivative(sum("read_bytes"), 1s) FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) fill(null) LIMIT 10;
name: lustre2
-------------
time non_negative_derivative
1461153210000000000 0
1461153240000000000 0
1461153270000000000 0
1461153300000000000 0
1461153330000000000 0
1461153360000000000 0
1461153390000000000 0
1461153420000000000 0
1461153450000000000 0
1461153480000000000 0
> SELECT non_negative_derivative(sum("read_bytes"), 1s) FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) fill(null) LIMIT 10;
name: lustre2
-------------
time non_negative_derivative
1461153210000000000 4.387031397146e+11
1461153240000000000 0
1461153270000000000 0
1461153300000000000 0
1461153330000000000 0
1461153360000000000 0
1461153390000000000 0
1461153420000000000 0
1461153450000000000 0
1461153480000000000 0
In fact if I run this query every second, in seconds 1 - 10 the first value is 0, in seconds 11-20, the first value is a very large number, and in seconds 21 - 30, the first value is an even larger number (almost exactly double). Then the first value returns to 0 after that and the cycle starts again. This is due to the sum values being very different (three queries run with about 10s in between):
> SELECT sum("read_bytes") FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) LIMIT 10;
name: lustre2
-------------
time sum
1461153630000000000 13161094191438
1461153660000000000 39483282574314
1461153690000000000 39483282574314
1461153720000000000 39483282574314
1461153750000000000 39483282574314
1461153780000000000 39483282574314
1461153810000000000 39483282574314
1461153840000000000 39483282574314
1461153870000000000 39483282574314
1461153900000000000 39483282574314
> SELECT sum("read_bytes") FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) LIMIT 10;
name: lustre2
-------------
time sum
1461153660000000000 26322188382876
1461153690000000000 39483282574314
1461153720000000000 39483282574314
1461153750000000000 39483282574314
1461153780000000000 39483282574314
1461153810000000000 39483282574314
1461153840000000000 39483282574314
1461153870000000000 39483282574314
1461153900000000000 39483282574314
1461153930000000000 39483282574314
> SELECT sum("read_bytes") FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) LIMIT 10;
name: lustre2
-------------
time sum
1461153660000000000
1461153690000000000 39483282574314
1461153720000000000 39483282574314
1461153750000000000 39483282574314
1461153780000000000 39483282574314
1461153810000000000 39483282574314
1461153840000000000 39483282574314
1461153870000000000 39483282574314
1461153900000000000 39483282574314
1461153930000000000 39483282574314
When I look at the raw data points they are staying constant as expected since there is no activity right now (using one host as an example):
> SELECT read_bytes FROM "lustre2" WHERE "host"='hpcleo00' GROUP BY "name" LIMIT 10;
name: lustre2
tags: name=lustree-OST0000
time read_bytes
---- ----------
1459977920000000000 39325212672
1459977930000000000 39325212672
1459977940000000000 39325212672
1459977950000000000 39325212672
1459977960000000000 39325212672
1459977970000000000 39325212672
1459977980000000000 39325212672
1459977990000000000 39325212672
1459978000000000000 39325212672
1459978010000000000 39325212672
name: lustre2
tags: name=lustree-OST0001
time read_bytes
---- ----------
1459977920000000000 39718395904
1459977930000000000 39718395904
1459977940000000000 39718395904
1459977950000000000 39718395904
1459977960000000000 39718395904
1459977970000000000 39718395904
1459977980000000000 39718395904
1459977990000000000 39718395904
1459978000000000000 39718395904
1459978010000000000 39718395904
name: lustre2
tags: name=lustree-OST0002
time read_bytes
---- ----------
1459977920000000000 59411320832
1459977930000000000 59411320832
1459977940000000000 59411320832
1459977950000000000 59411320832
1459977960000000000 59411320832
1459977970000000000 59411320832
1459977980000000000 59411320832
1459977990000000000 59411320832
1459978000000000000 59411320832
1459978010000000000 59411320832
name: lustre2
tags: name=lustree-OST0003
time read_bytes
---- ----------
1459977920000000000 59857276928
1459977930000000000 59857276928
1459977940000000000 59857276928
1459977950000000000 59857276928
1459977960000000000 59857276928
1459977970000000000 59857276928
1459977980000000000 59857276928
1459977990000000000 59857276928
1459978000000000000 59857276928
1459978010000000000 59857276928
name: lustre2
tags: name=lustree-OST0004
time read_bytes
---- ----------
1459977920000000000 39734607872
1459977930000000000 39734607872
1459977940000000000 39734607872
1459977950000000000 39734607872
1459977960000000000 39734607872
1459977970000000000 39734607872
1459977980000000000 39734607872
1459977990000000000 39734607872
1459978000000000000 39734607872
1459978010000000000 39734607872
Note that I have also tried all the fill options (none, null, 0, previous) and they do not help.
I’ve read through https://github.com/influxdata/influxdb/issues/4237, https://github.com/influxdata/influxdb/pull/4292, and all the linked issues, but I think this is different.
Thanks, Shawn
About this issue
- Original URL
- State: open
- Created 8 years ago
- Reactions: 19
- Comments: 38 (3 by maintainers)
Are there plans to fix this? I have to add
offset 1to all queries doingnon_negative_derivative(sum("value"), 10s)to not get the spike as the first value.We are hitting this, too. It’s frustrating; we have hundreds of graphs of rates (
non_negative_derivative), and this bug is rendering them fairly useless. Can we get a sense of whether this will be fixed?Still present and still problematic
I’ve been looking over this problem as a user, and it doesn’t appear that this is an actual bug. What’s happening is that when you set upper/lower bounds on the time range using
TIME()plus or minus some interval, and that time range doesn’t 100% align with the buckets in yourGROUP BYinterval,SUM()ends up summing less points into the first and last buckets than you expect – this is actually the case regardless of whether derivatives are involved. As a result, both the first and last data points being differenced are smaller than expected. When you useDERIVATIVE(), this manifests as a huge jump between that first bucket value, based on incomplete data, and the second bucket value, based on complete data (and vice versa at the end). As a result, you start the series with an enormous positive spike and end it with an enormous negative one.To get the results we want, what we really need is a way to normalize the time boundaries on the query so they align with the bucket width. This is probably a task best left to Grafana or another dashboard.
Still seeing the issue. Spikes at start and end of graphs are still visible when doing a group by on time.
Hitting this issue as well in InfluxDB 1.2.4.
non_negative_derivativewithsum()as an aggregator creates a huge spike at the beginning of the graph and makes the rest of the chart unreadable. Based on @jgoldschrafe’s comment above, it seems like this might be related to #5943 and/or #6878 .This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@hardiksondagar There you go
I was encountering this same issue in InfluxDB v2 using the Flux language as well. I know the limit() function was introduced to counter this but it did not solve the issues I was seeing described here.
What ended up fixing it for me is changing the order in which I group and then run the derivative function.
|> filter(fn: (r) => r "some data")|> derivative(unit: 1s, nonNegative: true, columns: ["_value"], timeColumn: "_time")|> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)|> yield(name: "mean")Previously my derivative and aggregatewindow function were swapped order wise and that resulted in the large spike in a revolving 10 second window. Now after swapping them this has seemingly fixed this.
I know this thread was primarily about the InfluxQL rather than Flux but I ran into this thread when trying to solve it in Flux and there was some great info in here which lead me to the solve. Hope this helps someone!
I have same problem. @panzerdev , thanks for the offset idea.
One way to solve this in Grafana, at least for difference() queries, is to have a nested query, like this:
SELECT sum("Entries") FROM (SELECT difference(sum("titanEdgeViewEntryCount")) AS "Entries" FROM "titanEdgeViewTable" WHERE ("sysName" =~ /^$hostname$/) AND $timeFilter GROUP BY time(1m), "titanEdgeViewName", "sysName" fill(previous) offset 1) group by time($int), "titanEdgeViewName", "sysName"You group by the default time interval in the inner query, and you can then sum the differences and group by another time interval in the outer query. I have not tested it on derivate() or non_negative_derivate() queries.I managed to get rid of those spikes by using percentile(95) selector instead of mean() aggregation. So if this makes sense for your data you can use this workaround.
A lot of useful information here! my today’s workaround for this is to apply specific time range in Grafana. Usually the auto interval for longer time ranges is 5 minutes. Then to ensure time range does not cross the intervals we use
/hto round it to whole hours. Unfortunately it means also that from 7:00:01 to 8:00:00 the latest data you see is from 6:59:59 and it could be a serious issue for some of you.It would be cool to have in Grafana sth like
/5mor/$__interval.I have same problem. Setting offset 1 and the time of group by Is greater than 20 can solve it. thanks the offset idea @panzerdev