influxdb: InfluxDB: High CPU utilization in v.1.8.1 & v.1.8.2
Experienced performance issues with InfluxDB after upgrading from InfluxDB v.1.8.0 to v1.8.1 or v.1.8.2 Currently the problem is temporarily handled by downgrading back to 1.8.0.
Steps to reproduce: In this environment, nothing else than
- Upgrade InfluxDB version to 1.8.1 or 1.8.2
Expected behavior: InfluxDB updates to a newer version and continues to work without issues.
Actual behavior: CPU Utilization spiking to ~100%, some databases are being queried successfully, where as at least one of the largest databases (~65GB) is returning HTTP POST 500 / timeout. Other metrics, such as memory & disk metrics are not affected drastically.
Screenshots from Grafana to visualize the CPU utilization after upgrading to v.1.8.1
Environment info:
- System info: Linux 3.10.0-1127.18.2.el7.x86_64 x86_64
- InfluxDB version: InfluxDB v1.8.0 (git: 1.8 781490d)
- Host VM specs: CentOS 7 8 vCPUs 52 GB RAM 500 GB SSD disk
- Other environment details: No other heavy workloads running on the servers other than InfluxDB. Grafana front-end. ~65 databases ~150 GB of data (~85% raw data, ~15% downsampled) Config settings on default other than some directory settings and TSI indexing turned on Wal and Data directories are located on the same storage device
Logs: Example error line from journalctl:
Sep 10 10:01:04 influxdb02 influxd[22667]: ts=2020-09-10T07:01:04.768040Z lvl=error msg="[500] - \"timeout\"" log_id=0P9h7VtW000 service=httpd
Example error line from HTTP access log:
x.x.x.x - telegraf [10/Sep/2020:10:01:00 +0300] "POST /write?db=all_operated HTTP/1.1" 500 20 "-" "Go-http-client/1.1" 62720985-f333-11ea-ac27-42010ae8030c 10609871
Other notes:
CPU Load is also constantly high, these issues are most likely somehow linked together.
There was memory issues caused by CQs too, but disabling CQ:s on the largest database resolved them. CPU utilization & load was not affected by this. Retention policies in use:
> show retention policies
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
autogen 0s 168h0m0s 1 false
raw 336h0m0s 24h0m0s 1 true
agg 9600h0m0s 168h0m0s 1 false
Data is downsampled from “raw” to “agg” RP with continuous queries.
> show continuous queries
name: <database>
name query
---- -----
cq_aggregate CREATE CONTINUOUS QUERY cq_aggregate ON <database> BEGIN SELECT mean(*) INTO <database>.agg.:MEASUREMENT FROM <database>.raw./.*/ GROUP BY time(5m), * END
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 28
Commits related to this issue
- Update image - disable monitor https://github.com/influxdata/influxdb/issues/19543 — committed to VictorRobellini/K8s-homelab-pub by VictorRobellini 3 years ago
Yes. To make sure anyone else reading this thread understands what is going on, the two approaches have exactly the same effect:
The configuration file approach:
influxdb.conf→[monitor]→store-enabled = falseIf you’re running InfluxDB as a “native” install where you can easily get to
influxdb.confthen this approach is appropriate.The environment variable approach:
INFLUXDB_MONITOR_STORE_ENABLED=FALSEIf you’re running InfluxDB as a Docker container, it’s a little trickier to get to the
influxdb.conffile so this approach is more useful. There’s a 1:1 relationship between config file settings and environment variables so you can always do what you want.In any contest between the two approaches, the environment variable prevails.
Had same problem on 4GB Pi4. This recommendation from documentation helped:
To disable the _internal database, set "store-enabled" to "false" under the "[monitor]" section of your influxdb.conf.I tried all the mentioned ways, i.e.
To disable the _internal database, set "store-enabled" to "false" under the "[monitor]" section of your influxdb.confand also downgrading the Influxdb to 1.7.x but no luck.After reading about Influxdb debug and CPU profiling HTTP API HTTP API I was able to pin-down the issue, the problem was in the way I was making the query, my query involved more complex functions and also GROUP BAY tag.I also tried query analysis using
EXPLAIN ANALYZE (query)command to check how much time a query is taking to execute. I resolved that and noticed a huge Improvement in CPU load. Basically I can suggest the following:curl -o <file name> http://localhost:8086/debug/pprof/all?cpu=truee and collect result.EXPLAIN ANALYZE<query>and try to improve the query formatSHOW SERIES CARDINALITYI was having high CPU problems while testing my influxdb instance, but it drastically dropped and stayed stable as soon as I enabled authentication. Config file: [http] auth-enabled = true
Don’t know why, just passing on the experience.
Version 1.8
I think I figured this one out. I’m running a MING stack under Docker on a 4GB Raspberry Pi 4 (via SensorsIot/IOTstack) and had CPU utilisation from InfluxDB routinely sitting around 85%.
I have two identical RPis, one “live”, one “test”. The live RPi was ingesting data from MQTT via Node-Red at the rate of 600 insertions per hour. No continuous queries of my own. No retention policies of my own other than “keep everything”. A few Grafana dashboards but even when none of those were open, I was still seeing numbers like 85%.
This was on both RPis, even though the “test” RPi wasn’t ingesting any data nor responding to Grafana queries.
It made no sense.
It turns out to be the “_internal” database. For some reason, this takes a HUGE number of resources. It did seem to have a 24H retention policy so maybe it was all that gardening.
I added this environment key:
then recreated the InfluxDB container and, bingo, CPU on the “live” RPi is in the 1…5% range unless I do something like ask Grafana to show me a graph of something “for the last year”. That’ll get a quick spike to 250% of CPU for InfluxDB but it will drop back to negligible as soon as I close the browser window.
CPU on the “test” RPi has numbers like 0.3%.
Also turns out that it is safe to
drop database _internal. If you ever need to re-enable the setting, the_internaldatabase gets auto-recreated.I’ll leave it up to People Who Know to figure out why the internal monitoring database is such a resource hog.