operating-system: HassOS 8.1 breaks add-on
Describe the issue you are experiencing
I’m the maintainer of a niche Home Assistant add-on that adds an MQTT interface to a web-based burglar alarm (https://github.com/maisken/Paradox_IP150). It’s mostly python, wrapped into an Home Assistant add-on.
The code has been stable (no updates) for the last 2+ years at v1.1, working reliably up to and including HassOS 8.0.
With the update to HassOS 8.1, both myself and a few users have experienced an erratic behavior of the add-on. It seems that, at random intervals, outgoing network connections to the web UI of the burglar alarm go on timeout (at which point the add-on quits – see add-on logs below).
I know the link between the add-on behavior and the HassOS 8.1 update seems far fetched, but I ran several tests, and I wouldn’t be opening this issue if (to my own surprise) I hadn’t identified this as the cause of the erratic behavior. See below for all the tests I carried out which led me to this conclusion.
I’ve also just released v1.2 of the add-on, where I have refreshed all python dependencies, as well as rebuilt the add-on image against the latest Home Assistant add-on base image. But the problem persists.
Downgrading to HassOS 8.0 (ha os update --version 8.0) fixes the erratic behavior, and the add-on (both v1.1 and v1.2) works reliably as usual. Also, running the python code on a local machine, or running the add-on on the VS devcontainer deployment of Home Assistant (which, crucially, doesn’t use HassOS) works reliably. It’s just using HassOS 8.1 that seems problematic.
I’ve inspected the commits between 8.0 and 8.1:
- some add support for additional wireless network interfaces, or deal with NTP – these shouldn’t be relevant
- guest agents have been bumped: I tried to run 8.1 with agent support disabled on the host (so the guest agent would not start), but the same problem persist. So I tend to assume this is not the cause
- Kernel version has been bumped: I would be shocked if this was causing the issue, but I don’t know how to try more systematically
- Other items like: buildroot; grub2 alignment, RPI-RF-MOD etc: I’m not sure I understand their potential impact on the matter
The rest of my Home Assistant production deployment seems to operate correctly with 8.1 – just the add-on breaks this way.
If not a solution, would you have any insight on further tests I could perform, or how to test one-by-one the different commits that go from 8.0 to 8.1, maybe?
What operating system image do you use?
ova (for Virtual Machines)
What version of Home Assistant Operating System is installed?
8.1
Did you upgrade the Operating System.
Yes
Steps to reproduce the issue
The add-on can be installed via this repository, however it won’t start if it can’t connect to the expected burglar alarm web interface. Reproducing the issue may be tricky, but I’d be happy to run more tests and share the results as instructed.
Here I can provide evidence showing the update to HassOS 8.1 is causing the issue. In the image below, you see the history for the add-on MQTT alarm status. The red circle represents a 8.0 -> 8.1 update
Before the update, you can see the MQTT alarm status is reliable (Disarmed, then Armed Home). After the 8.1 update, see the several grey bits, where the add-on becomes “Unavailable”. This is a network connection timing out, the add-on quitting, then watchdog restarting it, making it available again. Notice the grey slots are frequent, but not regular.
Similarly, the image below shows a 8.1 -> 8.0 downgrade.
Before downgrading, you can see the “grey slot” (Unavailable) pattern. After the downgrade (red circle), the add-on becomes reliable again.
When running under 8.1, the add-on logs show “time out” errors like (taken from add-on v1.1; the v1.2 would use python 3.9, but the error is the same):
Exception in thread Thread-1: Traceback (most recent call last): File “/usr/lib/python3.8/site-packages/urllib3/connection.py”, line 156, in _new_conn conn = connection.create_connection( File “/usr/lib/python3.8/site-packages/urllib3/util/connection.py”, line 84, in create_connection raise err File “/usr/lib/python3.8/site-packages/urllib3/util/connection.py”, line 74, in create_connection sock.connect(sa) TimeoutError: [Errno 110] Operation timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File “/usr/lib/python3.8/site-packages/urllib3/connectionpool.py”, line 665, in urlopen httplib_response = self._make_request( File “/usr/lib/python3.8/site-packages/urllib3/connectionpool.py”, line 387, in _make_request conn.request(method, url, **httplib_request_kw) File “/usr/lib/python3.8/http/client.py”, line 1255, in request self._send_request(method, url, body, headers, encode_chunked) File “/usr/lib/python3.8/http/client.py”, line 1301, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File “/usr/lib/python3.8/http/client.py”, line 1250, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File “/usr/lib/python3.8/http/client.py”, line 1010, in _send_output self.send(msg) File “/usr/lib/python3.8/http/client.py”, line 950, in send self.connect() File “/usr/lib/python3.8/site-packages/urllib3/connection.py”, line 184, in connect conn = self._new_conn() File “/usr/lib/python3.8/site-packages/urllib3/connection.py”, line 168, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fb5f6b50940>: Failed to establish a new connection: [Errno 110] Operation timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File “/usr/lib/python3.8/site-packages/requests/adapters.py”, line 439, in send resp = conn.urlopen( File “/usr/lib/python3.8/site-packages/urllib3/connectionpool.py”, line 719, in urlopen retries = retries.increment( File “/usr/lib/python3.8/site-packages/urllib3/util/retry.py”, line 436, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host=‘10.0.100.20’, port=80): Max retries exceeded with url: /keep_alive.html?msgid=1 (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7fb5f6b50940>: Failed to establish a new connection: [Errno 110] Operation timed out’)) During handling of the above exception, another exception occurred: Traceback (most recent call last): File “/usr/lib/python3.8/threading.py”, line 932, in _bootstrap_inner self.run() File “/ip150.py”, line 29, in run self._one_keepalive() File “/ip150.py”, line 24, in _one_keepalive requests.get(‘{}/keep_alive.html’.format( File “/usr/lib/python3.8/site-packages/requests/api.py”, line 75, in get return request(‘get’, url, params=params, **kwargs) File “/usr/lib/python3.8/site-packages/requests/api.py”, line 60, in request return session.request(method=method, url=url, **kwargs) File “/usr/lib/python3.8/site-packages/requests/sessions.py”, line 533, in request resp = self.send(prep, **send_kwargs) File “/usr/lib/python3.8/site-packages/requests/sessions.py”, line 646, in send r = adapter.send(request, **kwargs) File “/usr/lib/python3.8/site-packages/requests/adapters.py”, line 516, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host=‘10.0.100.20’, port=80): Max retries exceeded with url: /keep_alive.html?msgid=1 (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7fb5f6b50940>: Failed to establish a new connection: [Errno 110] Operation timed out’))
Anything in the Supervisor logs that might be useful for us?
I've extensively looked at the Supervisor logs, but didn't find anything apparently relevant. Please let me know if there's any part of that log you'd like me to report here.
Anything in the Host logs that might be useful for us?
I've extensively looked at the Host logs, but didn't find anything apparently relevant. Please let me know if there's any part of that log you'd like me to report here.
System Health information
System Health
| version | core-2022.5.5 |
|---|---|
| installation_type | Home Assistant OS |
| dev | false |
| hassio | true |
| docker | true |
| user | root |
| virtualenv | false |
| python_version | 3.9.9 |
| os_name | Linux |
| os_version | 5.15.41 |
| arch | x86_64 |
| timezone | Europe/Madrid |
Home Assistant Community Store
| GitHub API | ok |
|---|---|
| GitHub Content | ok |
| GitHub Web | ok |
| GitHub API Calls Remaining | 5000 |
| Installed Version | 1.25.5 |
| Stage | running |
| Available Repositories | 1045 |
| Downloaded Repositories | 4 |
Home Assistant Cloud
| logged_in | false |
|---|---|
| can_reach_cert_server | ok |
| can_reach_cloud_auth | ok |
| can_reach_cloud | ok |
Home Assistant Supervisor
| host_os | Home Assistant OS 8.1 |
|---|---|
| update_channel | stable |
| supervisor_version | supervisor-2022.05.3 |
| docker_version | 20.10.14 |
| disk_total | 9.1 GB |
| disk_used | 6.4 GB |
| healthy | true |
| supported | true |
| board | ova |
| supervisor_api | ok |
| version_api | ok |
| installed_addons | Terminal & SSH (9.4.0), File editor (5.3.3), Node-RED (10.1.1), Mosquitto broker (6.1.2), Paradox IP150 MQTT Adapter (1.2) |
Dashboards
| dashboards | 1 |
|---|---|
| resources | 2 |
| views | 8 |
| mode | storage |
Additional information
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 29 (7 by maintainers)
I believe I figured out the core issue, and now sorted it somewhat in the add-on itself. In a nutshell, the add-on HTTP client is re-using source TCP ports too early, before the TIME_WAIT timeout expires on the Paradox web server. Since there are several TCP source ports, usually the kernel picks a fresh one (or re-uses one after a long time), and the HTTP connection goes well. However, sometimes the kernel picks the same source TCP port that was just used a few seconds before. In that case, the Paradox web server (which is still in its TIME_WAIT state) ignores the new TCP SYN packets, leading to the timeout on the client.
The problem was fundamentally always there, but it became exacerbated with kernel 5.15.41 (introduced in HassOS 8.1, and several other recent Linux distributions), when the source TCP port picking algorithm was changed, making early port re-use more likely.
The add-on now uses a very aggressive timeout on HTTP polling GET requests. If the timeout expires, it likely means we’re facing an early TCP source port re-use; so the GET request is re-issued (up to 5 attempts).
More background, for completeness:
That may explain other timeouts at the OS level, as reported above in this thread.
Hope this helps and thanks for the support!
Oh well, indeed! Also thanks for all the support, and patience with this issue!
@djsomi thanks for sharing. Please let’s move to the add-on thread for support about running the add-on.
Its already published, seems OK for me from yesterday, sensors not showing offline status. I only get this in the logs: