core: Client unable to keep up with pending messages
See latest update here: https://github.com/home-assistant/core/issues/68030#issuecomment-1153241799
The problem
Client unable to keep up with pending messages Errors in my log. They occur sporadically over the day, including times no client apps are open. I think it’s related to the last update to supervisor to 2022.03.3 I noticed these errors since that update. My CPU load is Low, avg 5% spikes below 40%
What version of Home Assistant Core has the issue?
Core-2022.3.3
What was the last working version of Home Assistant Core?
Core-2022.3.1
What type of installation are you running?
Home Assistant OS
Integration causing the issue
websocket_api
Link to integration documentation on our website
https://www.home-assistant.io/integrations/websocket_api/
Diagnostics information
No response
Example YAML snippet
No response
Anything in the logs that might be useful for us?
Logger: homeassistant.components.websocket_api.http.connection
Source: components/websocket_api/http.py:126
Integration: Home Assistant WebSocket API (documentation, issues)
First occurred: 00:15:09 (7 occurrences)
Last logged: 09:06:25
[547244200192] Client unable to keep up with pending messages. Stayed over 512 for 5 seconds
Additional information
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 11
- Comments: 55 (15 by maintainers)
Analyzing the problem
After some testing, it is evident that the problem is significant CPU overhead, syscall overhead, or I/O overload. Since this is triggered by just about anything causing the overload, I will only outline the most common case below.
When an event storm happens (think many events happen nearly the same time), the system gets behind, so the messages get queued. The queue processor can’t keep up and eventually gives up. Adding 5000 power sensors with a modified https://github.com/bdraco/scaletest quickly triggers the issue.
Even if all of the below gets implemented, this issue is usually just a symptom of an overload problem. It is essential to find the cause and reduce the event storms from happening in the first place. If all else fails, throwing hardware at it is one way to prevent this from happening.
Solving the I/O + syscall issue
When designing the live logbook API, we added event coalescing to reduce the risk the system wouldn’t be able to keep up. Arguably the risk is a lot lower here since its unlikely multiple live logbook stream are happening at once.
There are two WebSocket APIs that generate the most traffic subscribe_events and subscribe_entities. The newer
subscribe_entitiesgenerates much smaller payloads, so it is less likely to happen since we added that. Thesubscribe_eventsendpoint generates larger data structures which are much more expensive to serialize to JSON. Currently, the frontend uses the newsubscribe_entitesendpoint, but software like appdaemon etc, still uses the oldersubscribe_eventsendpoint, which is more likely to trigger the issue. Even with the new fastersubscribe_entitiesAPI with 5000 power sensors, we still get to this state.Adding support for coalescing multiple messages should significantly reduce the system call, and I/O overhead since the websocket won’t be in a read/write state as much. It reduces the serialize overhead (just a bit) since we wouldn’t have to pay overhead to serialize many small messages. The downside is we currently cache the serialize here, so this isn’t something we can do yet because the JSON serialize overhead is one of the most CPU-expensive paths in Home Assistant now.
Implementing event coalescing is a relatively large undertaking since all the frontend APIs would have to support message coalescing. There currently aren’t any plans to implement this, but I’m documenting it here in case someone has the time to do it in the future.
Solving the CPU bound case
As mentioned above, serializing to JSON is currently relatively expensive since it is all done with python’s built-in JSON serializer, which isn’t very speedy. A PR is open here to switch to using
orjson, which is much faster: https://github.com/home-assistant/core/pull/72847. Once merged, it will further reduce the risk that the overload can happen if it’s a CPU-bound case. It will likely help with the I/O + syscall case I outlined above as well but will probably not solve all the instances.If you still see this message, disable the integration that is overloading the system or upgrade to a faster cpu. Start by disabling custom integration if you don’t know which one.
There isn’t much more we can do to improve the efficiency of the system at this point.
I saw hundreds of this message today after auto-Supervisor-Update to 2022.04.0.
After a Core-reboot afterwards, I didn’t see this again now.
debugpypy-spyrecording https://community.home-assistant.io/t/instructions-to-install-py-spy-on-haos/480473Inspect -> Network(reload the page and watch thewebsocketstream)profiler.startserviceAdding
py-spyto the HA distribution with some simple controls to create a dump under development tools would be great. These dumps are used by the more experienced developers to analyse the performance culprits, but the process to create them is too complex for many of us. Best EricI had this errors since remember, ~4years running HA
Running as vm =>virtualbox =>Ubuntu (lots of hardware resources available)
Go ahead and use it. It’s unlikely to change at this point as it’s been proven over multiple releases. We do need to add some docs for it
Just a thought – how about a queue viewer? Or more descriptive log entries detailing which messages were rejected/delayed and the source?
@bdraco - Appreciate your response but I’m going to step in here and say this – In troubleshooting this issue, I have run my primary instance of HA on a VMware cluster, to which I’ve dedicated/prioritized 12 physical cores and 48GB of RAM, on super fast SSD storage and 10GbE. Local and hosted in known/validated datacenter hardware.
I’ve run down the addons and integrations that delay startup and are noisy with errors in the logs (especially with debug logging enabled). This issue comes and goes no matter the resources and no matter the integration idiosyncrasies.
So please don’t automatically dismiss reports of issues that seem to have obvious solutions so quickly – this issue has remained open and vexing because of the nebulous nature.
Thanks for your reply! I hope we can refocus our efforts together to get this resolved. At the very least, we should push to have the resources to troubleshoot this more efficiently – though the startup time monitoring was a big step in the right direction (https://dummylabs.com/posts/2021-03-16-home-assistant-startup-time-sensor/). Good place to start but the issue remains for users that have validated their installations.
No cpu bump likely means you hit the I/O limit of how much data can be processed down the websocket connection at once. We would have to implement the event coalescing to solve that since its likely the number of message and not the contents that are the problem.
the same here from time to time
I have this as well:
Could it be related to webscraper “scrape”, which seems flaky since update to 2022.4.x ?
This issue was once active and closed #40443
Many feel the need to reopen that - or this is just as good…
Personally, I never had this issue, it started when I updated to 2022.4.3