core: Client unable to keep up with pending messages

See latest update here: https://github.com/home-assistant/core/issues/68030#issuecomment-1153241799

The problem

Client unable to keep up with pending messages Errors in my log. They occur sporadically over the day, including times no client apps are open. I think it’s related to the last update to supervisor to 2022.03.3 I noticed these errors since that update. My CPU load is Low, avg 5% spikes below 40%

What version of Home Assistant Core has the issue?

Core-2022.3.3

What was the last working version of Home Assistant Core?

Core-2022.3.1

What type of installation are you running?

Home Assistant OS

Integration causing the issue

websocket_api

Link to integration documentation on our website

https://www.home-assistant.io/integrations/websocket_api/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Logger: homeassistant.components.websocket_api.http.connection
Source: components/websocket_api/http.py:126 
Integration: Home Assistant WebSocket API (documentation, issues) 
First occurred: 00:15:09 (7 occurrences) 
Last logged: 09:06:25

[547244200192] Client unable to keep up with pending messages. Stayed over 512 for 5 seconds

Additional information

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 11
  • Comments: 55 (15 by maintainers)

Most upvoted comments

Analyzing the problem

After some testing, it is evident that the problem is significant CPU overhead, syscall overhead, or I/O overload. Since this is triggered by just about anything causing the overload, I will only outline the most common case below.

When an event storm happens (think many events happen nearly the same time), the system gets behind, so the messages get queued. The queue processor can’t keep up and eventually gives up. Adding 5000 power sensors with a modified https://github.com/bdraco/scaletest quickly triggers the issue.

Even if all of the below gets implemented, this issue is usually just a symptom of an overload problem. It is essential to find the cause and reduce the event storms from happening in the first place. If all else fails, throwing hardware at it is one way to prevent this from happening.

Solving the I/O + syscall issue

When designing the live logbook API, we added event coalescing to reduce the risk the system wouldn’t be able to keep up. Arguably the risk is a lot lower here since its unlikely multiple live logbook stream are happening at once.

There are two WebSocket APIs that generate the most traffic subscribe_events and subscribe_entities. The newer subscribe_entities generates much smaller payloads, so it is less likely to happen since we added that. The subscribe_events endpoint generates larger data structures which are much more expensive to serialize to JSON. Currently, the frontend uses the new subscribe_entites endpoint, but software like appdaemon etc, still uses the older subscribe_events endpoint, which is more likely to trigger the issue. Even with the new faster subscribe_entities API with 5000 power sensors, we still get to this state.

Adding support for coalescing multiple messages should significantly reduce the system call, and I/O overhead since the websocket won’t be in a read/write state as much. It reduces the serialize overhead (just a bit) since we wouldn’t have to pay overhead to serialize many small messages. The downside is we currently cache the serialize here, so this isn’t something we can do yet because the JSON serialize overhead is one of the most CPU-expensive paths in Home Assistant now.

Implementing event coalescing is a relatively large undertaking since all the frontend APIs would have to support message coalescing. There currently aren’t any plans to implement this, but I’m documenting it here in case someone has the time to do it in the future.

Solving the CPU bound case

As mentioned above, serializing to JSON is currently relatively expensive since it is all done with python’s built-in JSON serializer, which isn’t very speedy. A PR is open here to switch to using orjson, which is much faster: https://github.com/home-assistant/core/pull/72847. Once merged, it will further reduce the risk that the overload can happen if it’s a CPU-bound case. It will likely help with the I/O + syscall case I outlined above as well but will probably not solve all the instances.

If you still see this message, disable the integration that is overloading the system or upgrade to a faster cpu. Start by disabling custom integration if you don’t know which one.

There isn’t much more we can do to improve the efficiency of the system at this point.

I saw hundreds of this message today after auto-Supervisor-Update to 2022.04.0.

After a Core-reboot afterwards, I didn’t see this again now.

@bdraco Would you mind discussing how you investigates these systems? I have been digging through logs, disabling integrations, whatever I can think of to get my instance working again and with the unstable UI (due to the WS issue) it is extremely time consuming and frustrating. I have power sensors (not what I think is a lot, and had them for more than a year now) and some integrations (mostly disabled now) but I still get this error. Just looking for some suggestions on how to look for these. I see the WS messages in the browser but not really sure how to correlate those to when things break.

Adding py-spy to the HA distribution with some simple controls to create a dump under development tools would be great. These dumps are used by the more experienced developers to analyse the performance culprits, but the process to create them is too complex for many of us. Best Eric

I had this errors since remember, ~4years running HA

Running as vm =>virtualbox =>Ubuntu (lots of hardware resources available)

I saw the subscribe_entities call being mentioned here. I cannot find that in the websocket API documentation so I did not use it for my custom addon.

Is this an internal method which is not supposed to used? Or is just the documentation missing?

Go ahead and use it. It’s unlikely to change at this point as it’s been proven over multiple releases. We do need to add some docs for it

Just a thought – how about a queue viewer? Or more descriptive log entries detailing which messages were rejected/delayed and the source?

@bdraco - Appreciate your response but I’m going to step in here and say this – In troubleshooting this issue, I have run my primary instance of HA on a VMware cluster, to which I’ve dedicated/prioritized 12 physical cores and 48GB of RAM, on super fast SSD storage and 10GbE. Local and hosted in known/validated datacenter hardware.

I’ve run down the addons and integrations that delay startup and are noisy with errors in the logs (especially with debug logging enabled). This issue comes and goes no matter the resources and no matter the integration idiosyncrasies.

So please don’t automatically dismiss reports of issues that seem to have obvious solutions so quickly – this issue has remained open and vexing because of the nebulous nature.

Thanks for your reply! I hope we can refocus our efforts together to get this resolved. At the very least, we should push to have the resources to troubleshoot this more efficiently – though the startup time monitoring was a big step in the right direction (https://dummylabs.com/posts/2021-03-16-home-assistant-startup-time-sensor/). Good place to start but the issue remains for users that have validated their installations.

fwiw I’ve noticed this error on occasion and found one in yesterday’s log. No cpu spike on my server, but the timing coincided with my wife running an errand - I had just updated her mobile client and re-added our Nabu Casa URL to see if her location would update (stopped working a while back). Her location did indeed update just fine, but the timing of the error may be suspect. Or not - thought I’d drop the breadcrumb in any case. Running HA Container / core-2022.7.5. LMK if more/different detail would help.

No cpu bump likely means you hit the I/O limit of how much data can be processed down the websocket connection at once. We would have to implement the event coalescing to solve that since its likely the number of message and not the contents that are the problem.

the same here from time to time

Logger: homeassistant.components.websocket_api.http.connection
Source: components/websocket_api/http.py:130
Integration: Home Assistant WebSocket API ([documentation](https://www.home-assistant.io/integrations/websocket_api), [issues](https://github.com/home-assistant/home-assistant/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+websocket_api%22))
First occurred: 15 czerwca 2022, 10:09:28 (4 occurrences)
Last logged: 13:41:57

[546576232256] Client unable to keep up with pending messages. Stayed over 512 for 5 seconds
[545397552176] Client unable to keep up with pending messages. Stayed over 512 for 5 seconds
[543838832336] Client unable to keep up with pending messages. Stayed over 512 for 5 seconds

I have this as well:

[140619212881488] Client unable to keep up with pending messages. Stayed over 512 for 5 seconds
7:41:05 PM – (ERROR) Home Assistant WebSocket API - message first occurred at 5:35:05 PM and shows up 263 times
[140619212881488] Client exceeded max pending messages [2]: 2048
7:41:00 PM – (ERROR) Home Assistant WebSocket API - message first occurred at 5:35:00 PM and shows up 682 times

Could it be related to webscraper “scrape”, which seems flaky since update to 2022.4.x ?

System:

version: core-2022.4.7
installation_type: Home Assistant OS
dev: false
hassio: true
docker: true
user: root
virtualenv: false
python_version: 3.9.9
os_name: Linux
os_version: 5.10.108
arch: x86_64


GitHub API: ok
GitHub Content: ok
GitHub Web: ok
GitHub API Calls Remaining: 4259
Installed Version: 1.24.5
Stage: running
Available Repositories: 1095
Downloaded Repositories: 26


can_reach_server: ok
remaining_requests: 18


logged_in: false
can_reach_cert_server: ok
can_reach_cloud_auth: ok
can_reach_cloud: ok


component_version: 0.9.42
reach_easee_cloud: ok
connected2stream: true


host_os: Home Assistant OS 7.6
update_channel: stable
supervisor_version: supervisor-2022.04.0
docker_version: 20.10.9
disk_total: 116.7 GB
disk_used: 25.8 GB
healthy: true
supported: true
board: generic-x86-64
supervisor_api: ok
version_api: ok
installed_addons: AirSonos (3.3.0), AppDaemon (0.8.2), InfluxDB (4.4.1), Grafana (7.5.2), MariaDB (2.4.0), Portainer (2.0.0), SSH & Web Terminal (10.1.1), phpMyAdmin (0.7.1), FTP (4.5.0), Glances (0.15.0), File editor (5.3.3), deCONZ (6.12.0), ESPHome (2022.3.1), Mosquitto broker (6.0.1), Log Viewer (0.13.0), NetDaemon V3 - beta (22.14.0), Node-RED (11.1.1)


dashboards: 4
resources: 17
views: 12
mode: storage


api_endpoint_reachable: ok

This issue was once active and closed #40443

Many feel the need to reopen that - or this is just as good…

Personally, I never had this issue, it started when I updated to 2022.4.3