influxdb: InfluxDB: High CPU utilization in v.1.8.1 & v.1.8.2

Experienced performance issues with InfluxDB after upgrading from InfluxDB v.1.8.0 to v1.8.1 or v.1.8.2 Currently the problem is temporarily handled by downgrading back to 1.8.0.

Steps to reproduce: In this environment, nothing else than

Upgrade InfluxDB version to 1.8.1 or 1.8.2

Expected behavior: InfluxDB updates to a newer version and continues to work without issues.

Actual behavior: CPU Utilization spiking to ~100%, some databases are being queried successfully, where as at least one of the largest databases (~65GB) is returning HTTP POST 500 / timeout. Other metrics, such as memory & disk metrics are not affected drastically.

Screenshots from Grafana to visualize the CPU utilization after upgrading to v.1.8.1 v1 8 1_cpu_util

Environment info:

System info: Linux 3.10.0-1127.18.2.el7.x86_64 x86_64
InfluxDB version: InfluxDB v1.8.0 (git: 1.8 781490d)
Host VM specs: CentOS 7 8 vCPUs 52 GB RAM 500 GB SSD disk
Other environment details: No other heavy workloads running on the servers other than InfluxDB. Grafana front-end. ~65 databases ~150 GB of data (~85% raw data, ~15% downsampled) Config settings on default other than some directory settings and TSI indexing turned on Wal and Data directories are located on the same storage device

Logs: Example error line from journalctl:

 Sep 10 10:01:04 influxdb02 influxd[22667]: ts=2020-09-10T07:01:04.768040Z lvl=error msg="[500] - \"timeout\"" log_id=0P9h7VtW000 service=httpd

Example error line from HTTP access log:

x.x.x.x - telegraf [10/Sep/2020:10:01:00 +0300] "POST /write?db=all_operated HTTP/1.1" 500 20 "-" "Go-http-client/1.1" 62720985-f333-11ea-ac27-42010ae8030c 10609871

Other notes: CPU Load is also constantly high, these issues are most likely somehow linked together. v 1 8 0_load

There was memory issues caused by CQs too, but disabling CQ:s on the largest database resolved them. CPU utilization & load was not affected by this. Retention policies in use:

> show retention policies
name    duration  shardGroupDuration replicaN default
----    --------  ------------------ -------- -------
autogen 0s        168h0m0s           1        false
raw     336h0m0s  24h0m0s            1        true
agg     9600h0m0s 168h0m0s           1        false

Data is downsampled from “raw” to “agg” RP with continuous queries.

> show continuous queries
name: <database>
name         query
----         -----
cq_aggregate CREATE CONTINUOUS QUERY cq_aggregate ON <database> BEGIN SELECT mean(*) INTO <database>.agg.:MEASUREMENT FROM <database>.raw./.*/ GROUP BY time(5m), * END

About this issue

Original URL
State: open
Created 4 years ago
Comments: 28

Commits related to this issue

Update image - disable monitor https://github.com/influxdata/influxdb/issues/19543 — committed to VictorRobellini/K8s-homelab-pub by VictorRobellini 3 years ago

Most upvoted comments

Yes. To make sure anyone else reading this thread understands what is going on, the two approaches have exactly the same effect:

The configuration file approach:
- influxdb.conf → [monitor] → store-enabled = false
If you’re running InfluxDB as a “native” install where you can easily get to influxdb.conf then this approach is appropriate.
The environment variable approach:
- INFLUXDB_MONITOR_STORE_ENABLED=FALSE
If you’re running InfluxDB as a Docker container, it’s a little trickier to get to the influxdb.conf file so this approach is more useful. There’s a 1:1 relationship between config file settings and environment variables so you can always do what you want.

In any contest between the two approaches, the environment variable prevails.

Paraphraser on Feb 6, 2021

Had same problem on 4GB Pi4. This recommendation from documentation helped: To disable the _internal database, set "store-enabled" to "false" under the "[monitor]" section of your influxdb.conf.

anmenaga on Feb 6, 2021

I tried all the mentioned ways, i.e. To disable the _internal database, set "store-enabled" to "false" under the "[monitor]" section of your influxdb.conf and also downgrading the Influxdb to 1.7.x but no luck.

After reading about Influxdb debug and CPU profiling HTTP API HTTP API I was able to pin-down the issue, the problem was in the way I was making the query, my query involved more complex functions and also GROUP BAY tag.I also tried query analysis using EXPLAIN ANALYZE (query) command to check how much time a query is taking to execute. I resolved that and noticed a huge Improvement in CPU load. Basically I can suggest the following:

Run the CPU profile analysis using influxdb HTTP API with the command curl -o <file name> http://localhost:8086/debug/pprof/all?cpu=true e and collect result.
Visualize the result using go tool like PPROFtool and find the problem
Identify the complex or moderate query using EXPLAIN ANALYZE<query> and try to improve the query format
Also get the information about number of request per second, series cardinality -> SHOW SERIES CARDINALITY
Before designing any schema and Influx client check the hardware recommendation ->> Harware Guidelines

Tatya-winchu on Apr 13, 2021

I was having high CPU problems while testing my influxdb instance, but it drastically dropped and stayed stable as soon as I enabled authentication. Config file: [http] auth-enabled = true

Don’t know why, just passing on the experience.

Version 1.8

ajgomezrivera on Oct 26, 2021

I think I figured this one out. I’m running a MING stack under Docker on a 4GB Raspberry Pi 4 (via SensorsIot/IOTstack) and had CPU utilisation from InfluxDB routinely sitting around 85%.

This is Influx 1.8.3 (no Docker container for Influx 2.0 has come my way yet).

I have two identical RPis, one “live”, one “test”. The live RPi was ingesting data from MQTT via Node-Red at the rate of 600 insertions per hour. No continuous queries of my own. No retention policies of my own other than “keep everything”. A few Grafana dashboards but even when none of those were open, I was still seeing numbers like 85%.

This was on both RPis, even though the “test” RPi wasn’t ingesting any data nor responding to Grafana queries.

It made no sense.

It turns out to be the “_internal” database. For some reason, this takes a HUGE number of resources. It did seem to have a 24H retention policy so maybe it was all that gardening.

I added this environment key:

INFLUXDB_MONITOR_STORE_ENABLED=FALSE

then recreated the InfluxDB container and, bingo, CPU on the “live” RPi is in the 1…5% range unless I do something like ask Grafana to show me a graph of something “for the last year”. That’ll get a quick spike to 250% of CPU for InfluxDB but it will drop back to negligible as soon as I close the browser window.

CPU on the “test” RPi has numbers like 0.3%.

Also turns out that it is safe to drop database _internal. If you ever need to re-enable the setting, the _internal database gets auto-recreated.

I’ll leave it up to People Who Know to figure out why the internal monitoring database is such a resource hog.

Paraphraser on Jan 5, 2021