influxdb: Points missing after compaction

Bug report

System info: version 1.2.0, branch master, commit b7bb7e8359642b6e071735b50ae41f5eb343fd42 Docker image: influxdb:1.2.0-alpine with id 3a859cac1ae5 Docker is running inside Docker For Mac 1.13.1 (15353) Build 94675c5a76 on host with macOS 10.12.3

Steps to reproduce:

  1. Start a fresh container: docker run --name influxdb -d -p 8086:8086 -v $PWD:/var/lib/influxdb influxdb:1.2.0-alpine
  2. Create a database named “test”
  3. Create batches (BatchPoints) of 5000 points with timestamps starting at 0 and incrementing to 1,000,000. Batches should be configured to use write consistency = quorum. Each point has 2 tags and 5 fields.
  4. Start 5 threads and send the batches as quickly as possible. I used the influxdb-java client.
  5. Query for a count of points inserted once all requests are complete.

Expected behavior: Query returns a count of 1,000,000.

Actual behavior: Query sometimes returns 1,000,000 but often returns a count of slightly less (10-100 less on average).

Additional info:

I’ve observed the first query return a count of 1,000,000 and a subsequent query after compaction logs return a lower count.

Here’s the logs from a bad run where the first query returned the correct count and a query after compaction returned a lower count.

Client logs:

Generated and wrote 1000000 points.
Querying for how many were persisted:
QueryResult [results=[Result [series=[Series [name=jordan|app1|0|1, tags=null, columns=[time, count], values=[[0.0, 1000000.0]]]], error=null]], error=null]

Querying again after waiting a bit:
QueryResult [results=[Result [series=[Series [name=jordan|app1|0|1, tags=null, columns=[time, count], values=[[0.0, 999947.0]]]], error=null]], error=null]

InfluxDB logs: https://gist.github.com/jganoff/db12b62ece9f7dcb9b2e90ff5a73b4bc#file-influxdb-log

Here’s the requested debug information: https://gist.github.com/jganoff/db12b62ece9f7dcb9b2e90ff5a73b4bc

It looks like the missing points are from somewhere near the middle this time though they can be scattered around so don’t over index on their locality here:

> select count("value") from "jordan|app1|0|1" where time >= 0 group by time(1m) fill(none)
name: jordan|app1|0|1
time         count
----         -----
0            60000
60000000000  60000
120000000000 60000
180000000000 60000
240000000000 60000
300000000000 60000
360000000000 59970
420000000000 59977
480000000000 60000
540000000000 60000
600000000000 60000
660000000000 60000
720000000000 60000
780000000000 60000
840000000000 60000
900000000000 60000
960000000000 40000

With more resolution:

> select count("value") from "jordan|app1|0|1" where time >= 0 group by time(5s) fill(none)
time         count
----         -----
...
385000000000 5000
390000000000 5000
395000000000 4994
400000000000 4994
405000000000 4994
410000000000 4994
415000000000 4994
420000000000 4997
425000000000 4996
430000000000 4996
435000000000 4996
440000000000 4996
445000000000 4996
450000000000 5000
455000000000 5000
460000000000 5000
...

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 15 (7 by maintainers)

Commits related to this issue

Most upvoted comments

1000 iterations completed without fail. Looks like you nailed it. Thanks @jwilder!

I’m running 1,000 iterations of my test against branch jw-8084 (commit af46b29257cfb055d4f23548e89390f4167201ac) now. I’ll let you know how it goes.

I’m able to repro it now.