influxdb: [2.x] InfluxDB bucket stops reading+writing every couple of days

Steps to reproduce: List the minimal actions needed to reproduce the behavior.

Run influxdb2
Insert metrics (with telegraf)
Wait for some time

Expected behavior: Things keep working

Actual behavior: InfluxDB2 main index (“telegraf”) stops reading/writing data. Other indexes work fine - including one which is 5m aggregates of the telegraf raw index. (obviously this does not get any new data)

We have had this in the past randomly, but in the last few weeks has happened every few days. In the past it seemed to happen at 00:00UTC when influx did some internal DB maintenance - but now happens at random times.

Environment info:

System info: Linux 3.10.0-1160.66.1.el7.x86_64 x86_64
InfluxDB version: InfluxDB v2.3.0+SNAPSHOT.090f681737 (git: 090f681737) build_date: 2022-06-16T19:33:50Z
Other relevant environment details: CentOS 7 on vmware - lots of spare IO, CPU, memory.

Our database is 170GB, mostly metrics inserted every 60s, some every 600s. storage_writer_ok_points is around 2.5k/s for 7mins, then ~25k/s for 3mins for the every-600s burst.

VM has 32G RAM, 28G of which is in buffers/cache. 4 cores, and typically sits at around 90% idle. ~ 24IOPS, 8MiB/s

Config:

bolt-path = "/var/lib/influxdb/influxd.bolt"
engine-path = "/var/lib/influxdb/engine"
flux-log-enabled = "true"

We have enabled flux-log to see if specific queries are causing this - but it doesn’t seem to be.

Logs: Include snippet of errors in log.

Performance:

I captured a 10s pprof which I will attach.

I also have a core dump, and a 60s dump of debug/pprof/trace (though not sure if this has sensitive info but can share privately - the core dump certainly will)

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 3
Comments: 41 (8 by maintainers)

Most upvoted comments

@MaxOOOOON - I’m glad I found your post, I have been struggling with the same issue since moving from v1.8, I thought I was alone. I am running on windows and was contemplating moving to a Linux install to see if it would help. It seems like at 00:00UTC something triggers, causes Influx to go into a bad state where it continually eats memory. During this bad state nothing is working, and influx responds to clients with “internal error”. I even gave the server gobs and gobs of memory (like >100GB; hot-add because its a VM) to see if it was just a process that it needed to work through and it didn’t seem like it was helping the situation. Every time I’ve needed to restart the service (or wait for it to crash on its own due to OOM).

I’m going to try tweaking the same parameters too and see if it helps my situation.

john159753 on Sep 13, 2023

We have the same problem. The influx service freezes after 00:00UTC. Sometimes after 2 days, sometimes after a week. In logs only: lvl=debug msg=Request log_id=0giXx1PW000 service=http method=POST host=xxxxx path=/api/v2/write query="bucket=xxxx" proto=HTTP/1.1 status_code=499 response_size=90 content_length=-1 referrer= remote=ip:port user_agent=Telegraf took=10019.286ms error="internal error" error_code="internal error" and 400 errors on flux requests lvl=debug msg=Request log_id=0giXx1PW000 service=http method=POST host=xxxx:xxxxx path=/api/v2/query query="org=ORG" proto=HTTP/1.1 status_code=400 response_size=52 content_length=576 referrer= remote=xxxxxx user_agent=influxdb-client-go took=11.008ms error=invalid error_code=invalid body="{\"dialect\":{\"annotations\":[\"datatype\",\"group\",\"default\"],\"delimiter\":\",\",\"header\":true},\"query\":\"xxxxxxxxx\",\"type\":\"flux\"}"

After freezing, the influx is not available for writing and after ~ 1 hour, when the RAM runs out, OOM killer comes and restarting influx service.

System info: Linux elka2023-influxdb 5.10.0-18-cloud-amd64 #1 SMP Debian 5.10.140-1 (2022-09-02) x86_64 GNU/Linux Influx version: InfluxDB v2.4.0 (git: de247bab08) build_date: 2022-08-18T19:41:15Z OS: Debian 11

config:

bolt-path = "/opt/influxdb-data/influxdb/influxd.bolt"
engine-path = "/opt/influxdb-data/influxdb/engine"

flux-log-enabled = "true"
http-read-timeout = "15s"
http-write-timeout = "15s"
reporting-disabled = "true"
query-queue-size = 10
query-memory-bytes = 20485760
storage-compact-throughput-burst = 1331648
query-concurrency = 10
log-level = "debug"

MaxOOOOON on Mar 24, 2023

I don’t use flux and the queries are only in influxql but we still have this problem.

zekena2 on Jan 29, 2024

Hey @MaxOOOOON thanks for sending those logs over. Your issue has slightly different behavior than the issues above in that you are getting timeouts on the write rather than a full bucket lockout. Can you generate profiles when you’re seeing those writes fail? You can watch the logs for status_code=499 to know when the writes are timing out and profiles can be generated with

curl -o profiles.tar.gz "http://localhost:8086/debug/pprof/all?cpu=30s"
iostat -xd 1 30 > iostat.txt

jeffreyssmith2nd on Apr 10, 2023

We don’t have any deletes it’s only writes and reads except of course the bucket retention which is 30d.

zekena2 on Mar 6, 2023

Thanks for the reminder @zekena2. I’ve uploaded one of our core files now @jeffreyssmith2nd, to the SFTP you provided. Let me know if there’s more you need.

We have worked around this issue with a script that restarts influxdb if there are not writes for a few minutes, but we can probably update that script to generate new core files if that would be useful.

nward on Feb 12, 2023