core: MemoryError and crash of python and homeassistant docker container when using kef integration
The problem
The python3 process running homeassistant died. The “homeassistant” container crashed.
Looking at the home-assistant log (/mnt/data/supervisor/homeassistant/home-assistant.log from HassOS) the final entry is at midnight local time (15:00 UTC) and this is reflected in the last-modified timestamp of the file.
2020-06-03 00:00:18 WARNING (zeroconf-Engine-1848242208) [zeroconf] Exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/zeroconf/__init__.py", line 1292, in handle_read
MemoryError
Even though the MemoryError is written here, and the process doesn’t fully die until an hour later, the high memory consumption and CPU usage spikes about an hour prior.
At around 13:55:11 UTC is where memory and CPU utilization begins to climb.
With some debug logging enabled, I also see that even “DEBUG” log entries from homeassistant.core cease. Final DEBUG entry is at 22:55:01 local time (13:55:01 UTC) just prior to . The debug entries are only regular sensor update information, “Bus:Handling <Event state_changed[L]> …”.
Over the following hour, for “python3 -m homeassistant --config /config” the CPU is pegged at 100% and memory utilization climbs from a typical mere 335 MB VIRT (200 MB RES) to 1169 MB VIRT (~700 MB RES) in just 5 minutes (13:59:52 UTC). CPU utilization hangs back a bit (possibly because the SWAP isn’t being hammered.)
It kicks up again at 14:05:27 UTC, with 100% CPU and gradually climbing memory usage. It plateaus again at 14:06:33 UTC with 1384 MB VIRT (740 MB RES).
This remains stable until 14:19:12 UTC, going up further to 1652 MB VIRT by 14:20:34 UTC. Rinse and repeat at 14:37:15 UTC, climbing to 1987 MB VIRT by 14:39:02 UTC. 15:00:02 UTC we climb again, to 2029 MB VIRT by 15:00:20 UTC. This is around the time that MemoryError happens. The python3 process lives for another hour at the same memory utilization and approximately 15% CPU utilization on average. Then it dies and goes away.
Environment
- Home Assistant Core release with the issue: I’ve had similar memory exhaustion issues with home assistant since first installing. 0.109, and several 0.110.x. Lack of information in logs makes it impossible to determine the cause. Most recently 0.110.4, with HassOS 3.13.
- Last working Home Assistant Core release (if known): N/A
- Operating environment (Home Assistant/Supervised/Docker/venv): Raspberry Pi 3, HassOS based installation from recommended 32-bit image
- Integration causing this issue: N/A
- Link to integration documentation on our website: N/A
Problem-relevant configuration.yaml
First happened with default configuration.yaml, only loaded my Google Home units (4x Home Mini, 1x Nest Mini, 1x Chromecast, 1x Chromecast Audio, 1x JBL speaker) and Philips Hue bridge with one light. Since then I’ve added more, but with the lack of logs I cannot say if it’s related to any specific configuration or the base Home Assistant core system.
Traceback/Error logs
DEBUG was active on homeassistant.core and it stops producing any logs about an hour prior to MemoryError being output into the home-assistant.log. See the detailed flow of events above. The memory utilization of python starts to grow at the same time as the homeassistant.core DEBUG logging stops.
MemoryError
Additional information
I’ve tried to collect as many logs as possible but still cannot see what is triggering this runaway memory usage which results in a crash.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 30 (27 by maintainers)
That was the old implementation by the way. I just don’t understand how this could lead to the errors.
The rationale behind it code is that every time after communicating with the device, that the connection is broken off after two seconds. I do this because when there is a connection, one cannot use the KEF app for example (also reconnecting is cheap).
I am not sure why the memory error or high CPU usage happens.
Before
aiokef
did a very cheap check 2 times per second to see if the connection with the speaker should be closed. When it was actually closed, the check continued. Now it schedules a task to close the connection at a specific time and extends that time if a new operation is performed.Again, I am not sure why this would actually be a problem, but since this resulted in many calls I decided to fix it (which seems to work).
I’ve had py-spy recording while everything froze up. This is the output, where I see that aiokef (used in the
kef
integration and my own creation 😅) mostly occupied the event-loop. I’ve fixed this in https://github.com/home-assistant/core/pull/36574.Since I’ve made that change, I have not seen the problem anymore. I’ve also updated to
core==0.111.0b4
.@logan893, do you use the
kef
integration by any chance?