operating-system: HassOS 8.1 breaks add-on

Describe the issue you are experiencing

I’m the maintainer of a niche Home Assistant add-on that adds an MQTT interface to a web-based burglar alarm (https://github.com/maisken/Paradox_IP150). It’s mostly python, wrapped into an Home Assistant add-on.

The code has been stable (no updates) for the last 2+ years at v1.1, working reliably up to and including HassOS 8.0.

With the update to HassOS 8.1, both myself and a few users have experienced an erratic behavior of the add-on. It seems that, at random intervals, outgoing network connections to the web UI of the burglar alarm go on timeout (at which point the add-on quits – see add-on logs below).

I know the link between the add-on behavior and the HassOS 8.1 update seems far fetched, but I ran several tests, and I wouldn’t be opening this issue if (to my own surprise) I hadn’t identified this as the cause of the erratic behavior. See below for all the tests I carried out which led me to this conclusion.

I’ve also just released v1.2 of the add-on, where I have refreshed all python dependencies, as well as rebuilt the add-on image against the latest Home Assistant add-on base image. But the problem persists.

Downgrading to HassOS 8.0 (ha os update --version 8.0) fixes the erratic behavior, and the add-on (both v1.1 and v1.2) works reliably as usual. Also, running the python code on a local machine, or running the add-on on the VS devcontainer deployment of Home Assistant (which, crucially, doesn’t use HassOS) works reliably. It’s just using HassOS 8.1 that seems problematic.

I’ve inspected the commits between 8.0 and 8.1:

some add support for additional wireless network interfaces, or deal with NTP – these shouldn’t be relevant
guest agents have been bumped: I tried to run 8.1 with agent support disabled on the host (so the guest agent would not start), but the same problem persist. So I tend to assume this is not the cause
Kernel version has been bumped: I would be shocked if this was causing the issue, but I don’t know how to try more systematically
Other items like: buildroot; grub2 alignment, RPI-RF-MOD etc: I’m not sure I understand their potential impact on the matter

The rest of my Home Assistant production deployment seems to operate correctly with 8.1 – just the add-on breaks this way.

If not a solution, would you have any insight on further tests I could perform, or how to test one-by-one the different commits that go from 8.0 to 8.1, maybe?

What operating system image do you use?

ova (for Virtual Machines)

What version of Home Assistant Operating System is installed?

8.1

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

The add-on can be installed via this repository, however it won’t start if it can’t connect to the expected burglar alarm web interface. Reproducing the issue may be tricky, but I’d be happy to run more tests and share the results as instructed.

Here I can provide evidence showing the update to HassOS 8.1 is causing the issue. In the image below, you see the history for the add-on MQTT alarm status. The red circle represents a 8.0 -> 8.1 update Screen Shot 2022-05-29 at 15 30 26 Before the update, you can see the MQTT alarm status is reliable (Disarmed, then Armed Home). After the 8.1 update, see the several grey bits, where the add-on becomes “Unavailable”. This is a network connection timing out, the add-on quitting, then watchdog restarting it, making it available again. Notice the grey slots are frequent, but not regular.

Similarly, the image below shows a 8.1 -> 8.0 downgrade. Screen Shot 2022-05-31 at 19 08 50 Before downgrading, you can see the “grey slot” (Unavailable) pattern. After the downgrade (red circle), the add-on becomes reliable again.

When running under 8.1, the add-on logs show “time out” errors like (taken from add-on v1.1; the v1.2 would use python 3.9, but the error is the same):

Exception in thread Thread-1: Traceback (most recent call last): File “/usr/lib/python3.8/site-packages/urllib3/connection.py”, line 156, in _new_conn conn = connection.create_connection( File “/usr/lib/python3.8/site-packages/urllib3/util/connection.py”, line 84, in create_connection raise err File “/usr/lib/python3.8/site-packages/urllib3/util/connection.py”, line 74, in create_connection sock.connect(sa) TimeoutError: [Errno 110] Operation timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File “/usr/lib/python3.8/site-packages/urllib3/connectionpool.py”, line 665, in urlopen httplib_response = self._make_request( File “/usr/lib/python3.8/site-packages/urllib3/connectionpool.py”, line 387, in _make_request conn.request(method, url, **httplib_request_kw) File “/usr/lib/python3.8/http/client.py”, line 1255, in request self._send_request(method, url, body, headers, encode_chunked) File “/usr/lib/python3.8/http/client.py”, line 1301, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File “/usr/lib/python3.8/http/client.py”, line 1250, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File “/usr/lib/python3.8/http/client.py”, line 1010, in _send_output self.send(msg) File “/usr/lib/python3.8/http/client.py”, line 950, in send self.connect() File “/usr/lib/python3.8/site-packages/urllib3/connection.py”, line 184, in connect conn = self._new_conn() File “/usr/lib/python3.8/site-packages/urllib3/connection.py”, line 168, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fb5f6b50940>: Failed to establish a new connection: [Errno 110] Operation timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File “/usr/lib/python3.8/site-packages/requests/adapters.py”, line 439, in send resp = conn.urlopen( File “/usr/lib/python3.8/site-packages/urllib3/connectionpool.py”, line 719, in urlopen retries = retries.increment( File “/usr/lib/python3.8/site-packages/urllib3/util/retry.py”, line 436, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host=‘10.0.100.20’, port=80): Max retries exceeded with url: /keep_alive.html?msgid=1 (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7fb5f6b50940>: Failed to establish a new connection: [Errno 110] Operation timed out’)) During handling of the above exception, another exception occurred: Traceback (most recent call last): File “/usr/lib/python3.8/threading.py”, line 932, in _bootstrap_inner self.run() File “/ip150.py”, line 29, in run self._one_keepalive() File “/ip150.py”, line 24, in _one_keepalive requests.get(‘{}/keep_alive.html’.format( File “/usr/lib/python3.8/site-packages/requests/api.py”, line 75, in get return request(‘get’, url, params=params, **kwargs) File “/usr/lib/python3.8/site-packages/requests/api.py”, line 60, in request return session.request(method=method, url=url, **kwargs) File “/usr/lib/python3.8/site-packages/requests/sessions.py”, line 533, in request resp = self.send(prep, **send_kwargs) File “/usr/lib/python3.8/site-packages/requests/sessions.py”, line 646, in send r = adapter.send(request, **kwargs) File “/usr/lib/python3.8/site-packages/requests/adapters.py”, line 516, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host=‘10.0.100.20’, port=80): Max retries exceeded with url: /keep_alive.html?msgid=1 (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7fb5f6b50940>: Failed to establish a new connection: [Errno 110] Operation timed out’))

Anything in the Supervisor logs that might be useful for us?

I've extensively looked at the Supervisor logs, but didn't find anything apparently relevant. Please let me know if there's any part of that log you'd like me to report here.

Anything in the Host logs that might be useful for us?

I've extensively looked at the Host logs, but didn't find anything apparently relevant. Please let me know if there's any part of that log you'd like me to report here.

System Health information

System Health

version	core-2022.5.5
installation_type	Home Assistant OS
dev	false
hassio	true
docker	true
user	root
virtualenv	false
python_version	3.9.9
os_name	Linux
os_version	5.15.41
arch	x86_64
timezone	Europe/Madrid

Home Assistant Community Store

GitHub API	ok
GitHub Content	ok
GitHub Web	ok
GitHub API Calls Remaining	5000
Installed Version	1.25.5
Stage	running
Available Repositories	1045
Downloaded Repositories	4

Home Assistant Cloud

logged_in	false
can_reach_cert_server	ok
can_reach_cloud_auth	ok
can_reach_cloud	ok

Home Assistant Supervisor

host_os	Home Assistant OS 8.1
update_channel	stable
supervisor_version	supervisor-2022.05.3
docker_version	20.10.14
disk_total	9.1 GB
disk_used	6.4 GB
healthy	true
supported	true
board	ova
supervisor_api	ok
version_api	ok
installed_addons	Terminal & SSH (9.4.0), File editor (5.3.3), Node-RED (10.1.1), Mosquitto broker (6.1.2), Paradox IP150 MQTT Adapter (1.2)

Dashboards

dashboards	1
resources	2
views	8
mode	storage

Additional information

No response

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 29 (7 by maintainers)

Most upvoted comments

I believe I figured out the core issue, and now sorted it somewhat in the add-on itself. In a nutshell, the add-on HTTP client is re-using source TCP ports too early, before the TIME_WAIT timeout expires on the Paradox web server. Since there are several TCP source ports, usually the kernel picks a fresh one (or re-uses one after a long time), and the HTTP connection goes well. However, sometimes the kernel picks the same source TCP port that was just used a few seconds before. In that case, the Paradox web server (which is still in its TIME_WAIT state) ignores the new TCP SYN packets, leading to the timeout on the client.

The problem was fundamentally always there, but it became exacerbated with kernel 5.15.41 (introduced in HassOS 8.1, and several other recent Linux distributions), when the source TCP port picking algorithm was changed, making early port re-use more likely.

The add-on now uses a very aggressive timeout on HTTP polling GET requests. If the timeout expires, it likely means we’re facing an early TCP source port re-use; so the GET request is re-issued (up to 5 attempts).

More background, for completeness:

Review of TCP WAIT states
Linux 5.15.41 Changelog: commit b763fce193b42048444afd85d066b136288ad2c8 (and related) explain the change in TCP source port allocation
TCP source port fingerprinting attack, motivating the changes in 5.15.41

That may explain other timeouts at the OS level, as reported above in this thread.

Hope this helps and thanks for the support!

alfredopironti on Nov 13, 2022

Oh well, indeed! Also thanks for all the support, and patience with this issue!

alfredopironti on Nov 16, 2022

@djsomi thanks for sharing. Please let’s move to the add-on thread for support about running the add-on.

alfredopironti on Nov 15, 2022

Its already published, seems OK for me from yesterday, sensors not showing offline status. I only get this in the logs:

_WARNING:root:Error getting Paradox status update. Doing clean shutdown. Error details: HTTPConnectionPool(host='192.168.1.138', port=80): Read timed out.
WARNING:root:Error getting Paradox status update. Doing clean shutdown. Error details: HTTPConnectionPool(host='192.168.1.138', port=80): Read timed out.
WARNING:root:Error getting Paradox status update. Doing clean shutdown. Error details: HTTPConnectionPool(host='192.168.1.138', port=80): Read timed out.
WARNING:root:Error getting Paradox status update. Doing clean shutdown. Error details: HTTPConnectionPool(host='192.168.1.138', port=80): Read timed out.
WARNING:root:Error getting Paradox status update. Doing clean shutdown. Error details: HTTPConnectionPool(host='192.168.1.138', port=80): Read timed out.
WARNING:root:Error getting Paradox status update. Doing clean shutdown. Error details: HTTPConnectionPool(host='192.168.1.138', port=80): Read timed out.
WARNING:root:Error getting Paradox status update. Doing clean shutdown. Error details: HTTPConnectionPool(host='192.168.1.138', port=80): Read timed out.
WARNING:root:Error getting Paradox status update. Doing clean shutdown. Error details: HTTPConnectionPool(host='192.168.1.138', port=80): Read timed out.
WARNING:root:Error getting Paradox status update. Doing clean shutdown. Error details: HTTPConnectionPool(host='192.168.1.138', port=80): Read timed out.
Traceback (most recent call last):
  File "//ip150_mqtt.py", line 153, in <module>
    ip_mqtt.loop_forever()
  File "//ip150_mqtt.py", line 135, in loop_forever
    self.ip.login(self._cfg['PANEL_CODE'], self._cfg['PANEL_PASSWORD'])
  File "/ip150.py", line 149, in login
    raise Paradox_IP150_Error(
ip150.Paradox_IP150_Error: Wrong page fetcehd. Did you connect to the right server and port? Server returned: <!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01 Transitional//EN' 'http://www.w3.org/TR/html4/loose.dtd'>
<html>
<head>
	<meta http-equiv='content-type' content='text/html; charset=UTF-8' />
	<link rel='stylesheet' type='text/css' href='webstyles.css' />
	<script type='text/javascript' src='langue.js'></script><script type='text/javascript' src='commun.js'></script></head><body><div id = 'ERROR' style='margin-top:60px;'></div><title>RiasztÃ³ - Paradox</title><script type='text/javascript'>;var sre = "<table><tr><td align='middle'><div class='box' style='width:460px;margin:80px;'><div class='box-title strips'>" + top.ln_cant[1] + "</div><div class='box-content'><div class='msg' id='MESSAGE'></div></div></div></td></tr></table>";document.getElementById('ERROR').innerHTML = sre; document.getElementById('MESSAGE').innerHTML = top.cant('System Master');</script></body></html>_

djsomi on Nov 15, 2022