core: Broadlink Component - Handling Of Communication Errors

The problem

It appears that the Broadlink component will not process commands after the remote device is marked as unavailable.

Environment

Home Assistant Core release with the issue: 0.112.0
Last working Home Assistant Core release (if known): Unknown.
Operating environment (OS/Container/Supervised/Core): Supervised
Integration causing this issue: homeassistant.components.broadlink
Link to integration documentation on our website: https://www.home-assistant.io/integrations/broadlink/

Problem-relevant `configuration.yaml`

remote:
- platform: broadlink
  host: 192.168.2.144
  mac: blah
  type: rm4_mini
  name: Servers Broadlink
sensor:
- platform: broadlink
  host: 192.168.2.144
  mac: blah
  type: rm4_mini
  name: Servers Broadlink
  scan_interval: 60
  monitored_conditions:
    - temperature
    - humidity

Traceback/Error logs

cat home-assistant.log | grep broad
2020-07-01 21:19:34 INFO (SyncWorker_36) [homeassistant.loader] Loaded broadlink from homeassistant.components.broadlink
2020-07-01 21:19:34 INFO (MainThread) [homeassistant.components.remote] Setting up remote.broadlink
2020-07-01 21:19:37 INFO (MainThread) [homeassistant.components.sensor] Setting up sensor.broadlink
2020-07-01 22:01:21 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-01 22:02:17 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144
2020-07-01 23:31:47 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-01 23:32:39 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144
2020-07-02 01:02:10 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-02 01:02:10 ERROR (MainThread) [homeassistant.components.broadlink.remote] Failed to send 'fan only/server ac': The device is offline
2020-07-02 02:32:25 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-02 02:33:20 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144
2020-07-02 05:23:55 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-02 05:24:51 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144
2020-07-02 09:10:22 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-02 09:11:15 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144

Additional information

Note that the sensor module still operates every 60 seconds, which suggests the device was accessible after 2020-07-02 01:02:10 (as can be seen by the WARNING logs, which is a known issue with minimal impact).
Restarting the device has no effect.
Restarting HA seems the only mitigation.
HA reports the device was unavailable since 2020-07-02 01:02:10, which suggests to me that being unavailable was the cause. Manually instructing HA to send commands has no impact when the device is unavailable.
Checking the logs of the Wifi Access Point, the broadlink disconnects and reconnects 20 times an hour. This appears to be normal (roaming from one access point to another).
This is reproducible on demand.
- Send command to verify connectivity.
- Disconnect Broadlink device from power.
- Send command (which will fail).
- Reconnect and allow Broadlink device to power up.
- Send command and note that the command is never attempted (note that no communication error is logged and the device is marked as unavailable).
- Restart HA and note that sending commands is now possible.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 167 (72 by maintainers)

Most upvoted comments

I cracked it.

The heartbeat message that comes in via the cloud is message type 0x01. The RM3 doesn’t actually care if this comes from the cloud, via WiFi unicast, broadcast, or whatever. So, as long as you send a packet type 0x01 at least once every 3 minutes, even via broadcast on the WiFi, the devices will think they’re connected to the cloud and stop rebooting. It doesn’t even care about the packet checksum.

So, pending better integration of this into python-broadlink and/or HA, the quick fix is sticking this into your /etc/crontab:

* * * * * root echo -ne '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\0\0\0\0\0\0\0\0\0' | nc -bu 192.168.7.255 80 -w 0

This broadcasts the heartbeat message every minute (substitute 192.168.7.255 with the broadcast address of your IoT network, of course). I think there is an additional timeout on top of the 3 minute cloud timeout, so I’m currently checking to see if we can afford to send it less often. (Edit: nope, needs to be every ~3 minutes; 4 minutes after stopping sending the packets both of my RM3s rebooted on exactly the same second.)

Filed mjg59/python-broadlink#458 for adding support to python-broadlink.

(as for @felipediel unless he provides a packet log to prove no reconnects/DHCP request, or otherwise show something else he’s doing to trigger the device keepalive code, I’m just going to assume his devices are either successfully hitting the cloud, or rebooting like everyone else’s, and he just isn’t aware).

marcan on Oct 23, 2020

FWIW, I’m seeing reconnects every 3 minutes here too, on two RM3 minis. I suspected the “I can’t talk to the cloud so I’ll restart” cause too… I’m now trying things to see if I can convince it to give up.

So far:

The RM3 tries to use the DNS server provided via DHCP, or, if none, OpenDNS
It tries to resolve 10039activ.ibroadlink.com.
If resolution times out, it reassociates after 3 minutes
If it gets an NXDOMAIN, it continues retrying every 10 seconds and restarts after 3 minutes.
If it gets 127.0.0.1, it continues retrying too.
If I give it a real IP I control, it tries to connect on port 80
If that connection times out (TCP port not open), it retries, sending SYN packets for 20 seconds, then retrying the whole thing before reconnecting to the WiFi again after 3 minutes.
If I reply with ICMP port unreachable, it just ignores those replies. It does not implement ICMP properly. Same as dropping the packets.
If I go with TCP RST rejections, it retries much more frequently, more than once a second. Still gives up and reassociates after 3m.
If I set up an HTTP server that just returns 404s, it still retries more than once a second. Also, it does not use the domain for the Host header, but the IP, so you need to set the IP as the vhost name if your web server serves multiple domains.
Returning empty 200s also does not work.

At this point I’m going to have to let them talk to their cloud service to see what they actuallly want, but it’s clear that none of the obvious blocking solutions are working here.

marcan on Oct 22, 2020

For the time being I’m limited to remote access to the Raspberry Pi running Home Assistant. I intend to try the code once I’m able to, but in the meantime I have tried running the nc-command and running your code in a custom HA addon, but none of the solutions seem to work. The broadlink is still disconnecting every hour.

I noticed that busybox nc does not include the -b flag, but I’ve tried this without any success: echo -ne '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\0\0\0\0\0\0\0\0\0' | nc -uv 192.168.1.80 80 -w 1

I also made these three files, put them into the addons/ directory and started the custom addon, but that also didn’t seem to work. As far as I can tell this is the same as running your code above: config.json:

{
  "name": "Broadlink Keep Alive",
  "version": "1",
  "slug": "broadlink_keep_alive",
  "description": "Keep Broadlink from resetting",
  "arch": ["armhf", "armv7", "aarch64", "amd64", "i386"],
  "startup": "application",
  "boot": "auto",
  "options": {},
  "schema": {}
}

Dockerfile:

ARG BUILD_FROM
FROM $BUILD_FROM

ENV LANG C.UTF-8

# Install requirements for add-on
RUN apk add --no-cache python3
    
# Copy data for add-on
WORKDIR /
COPY main.py /

CMD [ "python3", "-u", "main.py" ]

main.py:

import socket
from time import sleep

while True:
    print('Sending keep alive')
    
    with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as conn:
        conn.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
        packet = bytearray(0x30)
        packet[0x26] = 1
        conn.sendto(packet, ('192.168.1.80', 80)) #Static IP of my Broadlink
    sleep(120)

IDmedia on Nov 18, 2020

Ok, things make more sense now. I actually had patched python-broadlink months back to fix the discovery issue with the incorrect binding, which as @litinoveweedle said is not a VLAN issue but just a general multiple interfaces issue (I was actually only using discovery on my laptop during configuration with the BroadlinkProv AP method, so I didn’t have to use the Broadlink app, and my laptop doesn’t use VLANs). The devices themselves only have one interface so this kind of problem does not apply.

@felipediel my devices recover quickly (2-3s) during reboots; I’m actually on a somewhat old HA/integration version so I’m not saying the current version can’t work around it with retries and timeouts. But of course it’s not ideal that the devices do this, which is why I wanted to find a way to solve the root cause. To me it seemed that you were saying that your devices weren’t rebooting at all and that this somehow had to do with VLANs.

Even with retries and timeouts and such, I expect there will be impossible to fix race conditions (for example, if you send it a command and it acknowledges it, but then reboots and cuts off the IR transmission; not sure if this specific one can happen but this kind of edge case is quite likely), so it is much preferable to stop them from rebooting.

2 seconds every 3 minutes is about 1% packet loss; you seem to be getting about 0.25% packet loss, so maybe your devices recover faster than mine and this is why the problem affects you less than other people. Of course for some people the devices will recover more slowly due to their router/AP/DHCP server being slower, or the WiFi being more congested, or even maybe something like the Broadlinks scanning WiFi channels in order so that people on channel 1 get faster reconnects than people on channel 11 😃 (in fact, it pretty much needs to spend at least 100-200ms per channel during a channel scan to catch a beacon, so for 11 channels that’s… 1-2 seconds!) Edit: and yeah, I’m on channel 11.

marcan on Oct 24, 2020

There is a misunderstanding here. I don’t have access to the logs. I don’t know if my devices are reconnecting every 3 minutes. I am using a simple ISP router with poor interface. What I said is: I blocked outbound connections from these devices with a drop silent rule (the only option I have) and I don’t miss any updates in Home Assistant. They are completely isolated and everything works fine. I can also ping them rock-solid for a long time:

--- 192.168.0.17 ping statistics ---
46033 packets transmitted, 45932 received, +24 errors, 0% packet loss, time 46104326ms
rtt min/avg/max/mdev = 1.352/22.035/826.071/83.565 ms, pipe 4

I am not saying the VLAN is the main problem. The root of the problem is definitely the successive reconnections when the device cannot reach the cloud. What I was trying to understand here is why my devices recover quickly and yours don’t.

@litinoveweedle I recently fixed an issue involving Broadlink devices and VLANs. It was related to socket.bind(). This is why I am insisting on this. Perhaps we are binding the socket to the wrong interface when we try to reach an offline device. But if you tested and the VLAN is not the problem, great, thank you. I believe in you. Now we are cooperating again. Did you check the logs in Home Assistant? This is what I wanted you to test.

I know that testing can be boring at times. I usually do my own tests, but when I don’t have the same tech as the person I’m helping, I have to ask. I don’t like when I ask for a test and people write a thousand line text saying that I’m stupid and the test won’t work. It is not meant to work, we are just gathering information. I usually do the logical thinking after the tests. So I think that’s why we started off bad. I had a stressful week, so I’m sorry if I was rude at some point. I am just tired.

@marcan I was starting to find you annoying, but this is really great news. I think I underestimated you. Thank you, you did a great job and earned my respect. I am adding this to the list of things I will bring to Home Assistant 🚀

felipediel on Oct 23, 2020

I’m following your #36914 pr with excitement. 😹

Is there currently any logic to recheck devices? In this case 9 hours passed without any re-attempts (commands to the device were executing every 10 minutes).

Silvenga on Jul 2, 2020

Sorry to add more to the thread, but to add to what was already discussed, it appears the newest rm3 minis on the market right now will constantly reset internally to try and connect to the cloud when they are being blocked by a firewall. Thus you see these messages - only my new rm3 mini does this. The other (2) older models don’t behave like this.

2021-05-12 16:09:30 ERROR (MainThread) [homeassistant.components.broadlink.updater] Error fetching Marks Office - IR Remote #2 (RM mini 3 at 192.168.0.126) data: [WinError 10054] An existing connection was forcibly closed by the remote host 2021-05-12 16:27:33 ERROR (MainThread) [homeassistant.components.broadlink.updater] Error fetching Marks Office - IR Remote #2 (RM mini 3 at 192.168.0.126) data: [WinError 10054] An existing connection was forcibly closed by the remote host 2021-05-12 16:45:36 ERROR (MainThread) [homeassistant.components.broadlink.updater] Error fetching Marks Office - IR Remote #2 (RM mini 3 at 192.168.0.126) data: [WinError 10054] An existing connection was forcibly closed by the remote host

I’ll see if allowing it to the cloud stops the errors, but then I’m concerned it won’t work with HA (which was my original issue). It seems the new rm3 minis suffer from this particular issue.

MarkHofmann11 on May 12, 2021

An update for the support of the last versions of Apple tvOS took something like 5-6 months to end up in an official Home Assistant release, this is why I asked… It will probably be the same for your new feature, let’s just be patient about it.

Regarding the workaround you’re talking about, I indeed have this configuration with 2 routers. The router A already had the UPnP disabled, while the router B has it enabled (for Home Assistant to gather the network throughput data) and the router B denies all internet access to the Broadlink RM Mini 3 (which I would like not to change). And the problem is still there, the heartbeat seems to me the best workaround, without modifyind the Broadlink firmware and removing the rebooting process.

As it is not a too painful problem, I’ll just wait for your code to be added to the main branch, I’d prefer avoiding tweaking my Home Assistant install too much, in case of a possible future reinstall. But thank you very much for your quick answer!

Castelvielh on May 7, 2021

Felipe, I’ll try the update once it become available. It may just be that I did something wrong. Let me know when it’s part of the main code, I’ll update and try again.

wipeout666 on Apr 17, 2021

@IDmedia Just out of curiosity, if you disable UPnP in your router settings, does the problem persist?

Yes, I can confirm that this issue persist with AND without UPnP. I’ve even tried multiple RM Pro+ I have lying around and act the same. No other device has any noticeable issue on the network. It’s in my parents home, so I’m limited by the ISPs router with barely any settings and for the time being it’s hard to devote time to debug the issue further unfortunately.

IDmedia on Nov 17, 2020

I don’t have UPnP enabled anywhere 😃

(non-bold config tree entries are empty/nonexistent, and by default that means off for these services).

I did some grepping and didn’t see any mention of FastCon in the RM3 firmware, so I get the feeling it isn’t supported in these devices.

I’ve looked through the firmware codepath for the watchdog, and I didn’t notice any branch that could stop it from triggering other than the aforementioned keepalive packet (which is how I found it).

The main network handling thread (including packet rx, reconnections, and the watchdog) is a big infinite while loop that ends like this:

The only thing that sets the app network status to 8 is receiving the keepalive packet, and that gets cleared by a different branch of the code earlier after another timeout.

marcan on Oct 30, 2020

The first problem is a design issue. When they made the firmware they didn’t foresee that users might want to block the internet access intentionally. So the devices have no logic to handle icmp-admin-prohibited packets. When the device cannot reach the cloud, it “thinks” that something went wrong and tries to reconfigure the network in order to reach the cloud, even when the user’s intention to block access is explicit.

Just to restate this somewhat, it’s a watchdog timer. The purpose is to reboot the whole device if something goes wrong and the cloud connection is down for too long. This makes perfect sense for people who want to use that feature, and I’m sure fixed a bunch of problems those users had. Treating icmp-admin-prohibited as “go away” wouldn’t be the most reliable way of fixing this for local users like us, because that’s still a protocol feature that can be accidentally triggered due to misconfiguration/etc, even transiently. IMO what they should’ve done is have an outright persistent configuration setting (e.g. as part of the Wi-Fi association packet) that just turns off the cloud stuff - not just the watchdog, but the connection attempts too, to save power and Wi-Fi traffic. But then again we all know Broadlink isn’t interested in people not using their app/cloud stuff… so we’re on our own here.

(Not arguing with you, just giving my opinion on how I’d handle this; it won’t happen anyway so it’s kind of moot 😃 ).

The recovery mechanism is not only a “reconnect to the WiFi” thing. The flow changes depending on the environment. I don’t have access to the firmware, so I don’t know exactly what is going on behind the scenes. All I know is: when there is a single network, the recovery is fast. When there are sub-networks, the devices take a long time to recover. The experiment is 100% reproducible.

I hate to hit on this point again, but to the Broadlink, there is a single network here. My Broadlink devices are attached to an SSID that has a single IP subnet on a them. As far as the devices are concerned it is not different from any other dumb router with a single SSID. The fact that there are VLANs on the wire behind the AP, or that other SSIDs are being broadcast too, is not something the device can differentiate. It is not different than two neighbors with disconnected, unrelated SSIDs.

What the Broadlink devices do is simple. They reboot. The same function gets called that also gets called when there is a fault or some other unrecoverable condition. It’s the same thing as unplugging them and plugging them back in. I’m staring at the decompiled firmware here 😃

Multiple networks / Blocked -> Slow recovery / Errors

Complex network topology. I kept the previous network and created another network at 192.168.31.1 with an AX3600 router.

What do you mean by “created another network”? Created an isolated, standalone Wi-Fi network with a different SSID? Or bridged together two subnets into the same L2 network via Ethernet? In the latter case, did you have one or two DHCP servers?

If you have two IP subnets on the same network with separate DHCP servers, then of course devices are going to fight over which one they get an IP from, which is going to cause failures and timeouts. This is natural, it’s not a device bug, that would just be a broken network setup.

Edit: Or maybe they’re doing some sort of network scanning that delays recovery. The truth is we can’t precise exactly what is going on without looking at the code, so I’ll stop trying to guess. I am just sharing what I know so you can take your own conclusions.

If you mean creating a separate isolated SSID that the device has no knowledge of and ought to ignore, then the only reason that would slow things down is due to radio congestion. But then we’re back to this having to do nothing to do with binding or VLANs, it’s just a general “too much stuff on the air makes Wi-Fi slow” issue, which applies regardless of whether it’s the same person running two SSIDs, or just your neighbors. I mean, I run 2 SSIDs on 2.4GHz myself, but obviously my neighbors have some networks too 😃

Problem 3: Aggressive recovery mechanism when there are multiple WiFi networks configured

This is another problem I found while I was testing the recovery mechanism. I provisioned my RM mini 3 with a new WiFi network I’ve created with the AX3600 in order to do some tests. After the tests I provisioned the device back with the old network. But somehow when I disabled the access to the cloud, the device reconnected to the AX3600 network again. And guess what? My RM pro did it too! I didn’t tell my RM pro anything about that network. Did they share this information via Fastcon?

In this scenario, the network has changed, the IP addresses have changed, everything has changed when the devices failed to reach the cloud. So this is another example of how things can go wrong here.

This is surprising. I don’t think they would share info via any back channel, and I would think they only support a single set of Wi-Fi configs, but maybe they can have multiple? In that case it would make sense that after a reboot, they would end up associating to a network at random, of all the configured ones.

I’ve never connected my units to a network that isn’t the single IoT one they are supposed to be on, so this doesn’t apply to me…

Edit: did you use SmartConfig to set up the new WiFi network, or AP mode? SmartConfig is a broadcast based solution, so it is possible that devices that you didn’t intend to configure picked up the network details too…

Problem 4: Poor network conditions can trigger the recovery mechanism too

These are @IDmedia’s conditions. He didn’t block the devices on the firewall, but something was destabilizing his connection periodically, so the devices didn’t receive “keep-alives” from the cloud and triggered the recovery mechanism. Now he picked another DNS and the problem has been remedied.

Certainly, this is possible. It could also be an unrelated issue though (not the devices rebooting, but the bad DNS somehow causing something else to make HA not be able to talk to the devices - bad DNS tends to have many weird effects). We’d need packet logs to be sure of what happened…

Really, this is why I keep hammering on packet logs - because it’s very easy to end up drawing strange conclusions from just doing trial and error experiments and observing whether the devices are stable or not. But once you have a packet log, you know exactly what is going on.

Conclusion

Our side is clean. The problem is the recovery mechanism (in the firmware). The best we can do is the non-invasive keep-alive mechanism proposed by @marcan. So I will just do it.

There’s actually an advantage to doing this. This way the watchdog timer benefits us too. In other words, if the devices crash or the Wi-Fi AP does something weird or connectivity breaks for some reason, the keep-alives will cease to arrive and the device will reboot… which is exactly what you want. We will be taking advantage of the watchdog mechanism to increase reliability and automatic failure recovery in HA setups too.

marcan on Oct 29, 2020

Small update. To my surprise I found out that I can change the IP/DNS now from the Home Assistant UI. I’ve tried changing my DNS from my ISP to 1.1.1.1 and for some strange reason it seems to be way more stable:

IDmedia on Oct 27, 2020

I’ve tried upgrading to fw 57 using the Broadlink app and now it disconnectes every hour at 38 minutes instead, but it’s only unavailable for 1 minutes instead of 5-6. Still annoying, but I didn’t get any time debugging this further.

IDmedia on Oct 26, 2020

@litinoveweedle I recently fixed an issue involving Broadlink devices and VLANs. It was related to socket.bind(). This is why I am insisting on this. Perhaps we are binding the socket to the wrong interface when we try to reach an offline device. But if you tested and the VLAN is not the problem, great, thank you. I believe in you. Now we are cooperating again. Did you check the logs in Home Assistant? This is what I wanted you to test.

@felipediel I actually know about that bug, as it was affecting me as well. But it was not about VLANs (i.e. multiple dedicated LAN segments = L2 level of ISO/OSI model), but it was about wrongly selected IP of local HA interface for outgoing packets (i.e. multiple IP interfaces on the HA and/or multiple IP subnets on the same interface on HA = L3 level of ISO/OSI model)

I know this previous issue was solved in the 0.15.0 version of python_broadlink library as from that one discovery of Broadlink devices works OK. But this problem was completely different, although I know it could be confusing.

From the time I applied fix with keep-alive packets as discovered by @marcan my Broadlink devices are not disconnecting/reconnecting and therefore there are no more error messages in the HA log.

So to summarize this issue:

You added log message to report on network problem / connection to Broadlink device.
I reported that with no internet access Broadlink device are periodically restarting, triggering these error messages in the HA log.
@marcan discovered keep-alive packets to fools Broadlink device to think that is is connected to cloud even without any connection to internet and he provided hack to generate these fake keep-alive packets from HA.
I applied that fix in my HA and now my Broadlink device are not anymore restarting and therefore there are no more error messages in the HA log

Therefore I would like to ask you to accept proposal in mjg59/python-broadlink#458 - add into the library code to periodically generate keep-alive packets with payload as suggested. This would prevent Broadlink devices with no internet access from disconnecting/reconnecting/rebooting or whatever they do. 😃 Thank you.

litinoveweedle on Oct 24, 2020

@litinoveweedle if echo is troublesome for you, you can try perl:

perl -e 'print "\0" x 38 . "\1" . "\0" x 9' | nc <...>

marcan on Oct 23, 2020

You’re getting a literal \x01 in the packet, so the echo part is also different for you. Maybe you’re using a different shell? It’s supposed to be a 0x01 byte. There’s also a \n at the end and -ne at the beginning, so it looks like your version of echo doesn’t support the required options. I think the bash built-in echo should work, maybe this is a dash thing?

marcan on Oct 23, 2020

@felipediel “Disabling” my VLAN isn’t going to do anything because, as multiple people have told you several times, VLANs are completely transparent to WiFi devices and they can’t know nor care whether VLANs are in use or not.

Having one access point with VLANs connected to a host is literally equivalent in every way to having two separate access points with no VLANs connected to two network cards on the host. The WiFi devices cannot tell the difference. That’s how VLANs work. That’s the whole point of VLANs.

There is literally no way, shape, or form, for the Broadlink devices to know they are on a VLAN. They transmit and receive exactly the same packets. Every single bit. The same IP addresses. The same broadcast addresses. The same MAC addresses. VLANs make no difference. The only sides that are aware of VLANs are the wired devices that are VLAN-aware and used on tagged networks (which in my case includes all my switches, my AP, and my server).

If I could push a button and “disable” VLANs I would do it just to end this silly argument and prove that it doesn’t matter, but VLANs are a core part of how my home network works, and I can’t magically “disable” them. It’s not possible to do what I do without VLANs without literally sticking 5 ethernet cards into my server and having 5 times as many switches.

I am asking you to ask them to do this:

The obvious solution is implementing an additional logic to handle ICMP packets. If the device gets reject-with=icmp-admin-prohibited, the user blocked the connection intentionally, so the device should skip the reattempts. This can only be done in the firmware. Good luck with their support, I hope they fix it soon.

This is a simple and universal solution.

Yes, and I highly doubt Broadlink cares about users running Home Assistant and blocking their cloud service, so I’m not holding my breath that complaining to support will get us anywhere.

But what we’re trying to do here is find a solution that works today. You claim it works for you on v57. But instead of helping us by providing a packet log to show exactly what is necessary to make it work on v57, you are telling us the problem is “VLANs” without understanding how VLANs work. We can’t wave a magic wand and figure out what you’re doing to make it work. When something works in case A and not in case B then we need to understand what is different in both cases. You saw that everyone else happens to be using VLANs and wrongly concluded that they have anything to do with this. I am certain that you are wrong about VLANs, for the reasons I explained above, and which you can confirm if you study how VLANs work, how VLAN tags work, what a VLAN really does on the wire, and the fact that VLANs on the air over WiFi aren’t a thing that exists. So what I am asking of you now is, since we’re back to square 1 and we don’t know what works and what doesn’t, to help us by providing a packet log of your broadlink device, from cold startup through ~6 minutes, to show that it indeed doesn’t reboot, and figure out what data was exchanged that made it not do that.

marcan on Oct 23, 2020

@marcan I am not telling you to complain about VLANs. You can workaround the problem by disabling your VLAN if you want.

I am asking you to ask them to do this:

The obvious solution is implementing an additional logic to handle ICMP packets. If the device gets reject-with=icmp-admin-prohibited, the user blocked the connection intentionally, so the device should skip the reattempts. This can only be done in the firmware. Good luck with their support, I hope they fix it soon.

This is a simple and universal solution.

felipediel on Oct 22, 2020

@Silvenga People deserve credit for their work, and that is completely tangential to being told they are wrong when they are. I’m sure @felipediel has put a bunch of time into this integration, but he isn’t being helpful right now by claiming the problem is something that makes no sense whatsoever.

That said, I’ve had improving the broadlink integration in my TODO list for a while now, in particular to specify a device-agnostic IR blasting mechanism to enable integration with complex-protocol/state-dump IR devices (e.g. aircons and my ceiling lights which work the same way), but having this kind of experience with the developers makes me lean towards just keeping it to myself rather than contributing…

@felipediel Look, I don’t know what to say any more. VLANs don’t have broadcast addresses. 802.1q VLANs are a way of putting multiple Ethernet networks into one physical cable. That is all they are. That is why they are called Virtual LANs. The only thing a VLAN does is make one cable behave like several separate cables. VLANs do not go over WiFi. Broadlink doesn’t care about VLANs. Support doesn’t care about VLANs. VLANs can’t cause broadcast address confusion. Just, please, read up on the subject and drop the idea that we need to complain to Broadlink support about some broadcast address issue related to VLANs.

The problem we have is the devices reset every 3-5m when they can’t hit the cloud. You claim yours does not. Please provide a packet dump if you are certain it is not doing that for you.

marcan on Oct 22, 2020

Thanks @Silvenga! I will create an options flow to configure polling in the future, so it will be easier for users to make adjustments without the need for a restart. After that, we can discuss what are the best values for each device and then we define better defaults.

felipediel on Oct 22, 2020

@felipediel I think you’ve fixed this issue effectively. I don’t have issues, with at least my setup. So thanks a lot!

Maybe we should close this issue, and open a separate issue to gather more info on if the poll interval/method should be configured/changed?

Silvenga on Oct 22, 2020

The question is can we get them to stop doing that, and/or can we make HA robust against those resets.

I see this as the standard “hardware is inherently unreliable” problem. I don’t think we need to argue over if it’s happening or why it’s happening. It’s going to happen as a function of being wifi based hardware.

We really need to figure out the scope of the problem, what it impacts, and figure out solutions.

@felipediel has spent a lot of effort and his time on this code, plus many weeks, going back and forth in reviews. I feel this thread has shifted to an argument, felipediel deserves respect at the very least, if not gratitude.

Silvenga on Oct 22, 2020

I already solved the issue. You just need to give their support team a link to this conversation. It’s not that I’m stupid, I just don’t have access to their firmware to fix it for you, got it?

You claim it works for you, yet it doesn’t for us. We’ve already tried everything you suggested to make it work. The next step in figuring this out is for you to give us a known-good reference. That means a packet capture.

Yes, I use VLANs too, and I am absolutely confident they have nothing whatsoever to do with this problem.

Now I am 100% sure this is the problem.

Now you’re just being unhelpful, and deliberately ignorant. We’re telling you that’s not how VLANs work. You can look it up if you want.

This is the expected behavior, but we are talking about a bug. They are binding the socket to the wrong interface.

There is no wrong interface. The device sees a single network. The device has one interface. The device does not have any idea what a VLAN is or what VLAN it’s on, because all it sees is a single 802.11 WiFi network and it is the access point’s job to deal with whatever is on the Ethernet wire behind it, be it plain Ethernet or 802.1q VLANs or an L2 tunnel over IP or anything else you might want to come up with. As far as the device is concerned it is on one network with one IP subnet and there is no confusion possible.

Asking support about VLANs isn’t going to go anywhere, because VLANs are completely irrelevant to these devices. You have latched on to the idea that us using VLANs is the problem without understanding how VLANs work, and all you’re doing now is derailing the conversation.

If you want to help us, please provide a full packet capture of everything your v57 device does on the network, so we can find out what to do to get it to stop rebooting itself after a cloud service timeout.

marcan on Oct 22, 2020

@felipediel he’s right, please stop making stuff up about VLANs. VLANs behave the same as any normal isolated Ethernet network. The only correlation here is that there is a big overlap between the kind of geek paranoid enough to firewall IoT devices from the Internet and the kind of geek who happens to know about VLANs, and they are an obvious solution to this problem.

I already solved the issue. You just need to give their support team a link to this conversation. It’s not that I’m stupid, I just don’t have access to their firmware to fix it for you, got it?

Yes, I use VLANs too, and I am absolutely confident they have nothing whatsoever to do with this problem.

Now I am 100% sure this is the problem.

Networking is one of my jobs, I know what I’m doing here.

I respect your job, but we are never too old to learn something new.

As far as the devices involved are concerned, the Broadlink devices and one Ethernet (sub)interface on my Home Assistant server (which also handles DHCP/DNS/routing duties for this segment) are on the same isolated network segment, and the fact that VLANs are involved is completely irrelevant to them.

This is the expected behavior, but we are talking about a bug. They are binding the socket to the wrong interface.

felipediel on Oct 22, 2020

If we’re playing the “simplest explanation” game… since apparently all of us are having this issue except @felipediel, my Occam’s Razor diagnosis is that his firewall might not be set up properly and he is, in fact, letting them talk to the broadlink cloud 😃

It’s clear these things really want to talk to the Internet; @felipediel if you truly believe it works fine for you and they don’t reconnect, then what we need to move forward is a complete packet log of a broadlink on wifi, from startup through 5-6 minutes, to see what it is that you’re doing that the rest of us aren’t that convinces it to not drop off. I’ve already tried everything I could think of (and have been looking at tcpdump as I did to prove I was doing what I thought I was).

Barring that, there’s two things to be done here:

Make sure HA’s integration has long enough cmd/response timeouts to ride out one of the reconnect phases without more impact than a delay
Reverse engineer the firmware to figure out if there is some explicit codepath to trigger it to stop doing this.

marcan on Oct 22, 2020

@felipediel he’s right, please stop making stuff up about VLANs. VLANs behave the same as any normal isolated Ethernet network. The only correlation here is that there is a big overlap between the kind of geek paranoid enough to firewall IoT devices from the Internet and the kind of geek who happens to know about VLANs, and they are an obvious solution to this problem. Yes, I use VLANs too, and I am absolutely confident they have nothing whatsoever to do with this problem. Networking is one of my jobs, I know what I’m doing here.

As far as the devices involved are concerned, the Broadlink devices and one Ethernet (sub)interface on my Home Assistant server (which also handles DHCP/DNS/routing duties for this segment) are on the same isolated network segment, and the fact that VLANs are involved is completely irrelevant to them.

marcan on Oct 22, 2020

There was definitely smiley missing in my original reply, so please there were no bad intentions from my side. 😃 And I am definitely happy for any help, especially when as documented I am not only one affected by this, thank you for all… 😉

litinoveweedle on Oct 21, 2020

No problem.

@litinoveweedle What are the device types and firmware? I cannot reproduce this issue with my Broadlink RM mini 3 (0x2737) and RM pro (0x2787). They work fine locally, no error messages in the logs.

Regarding your info, I had to recently temporarily disable firewall rule blocking my “smart devices” LAN subnet from internet. It is possible that RM were pushed FW upgrade. 😦

It is definitely not WiFi - signal are OK, and other devices connected to same virtual AP are not disconnecting. ONLY Broadlink devices ara disconnecting and they are doing it exactly after 5min. Hardly coincidence. All my devices have static DHCP lease. There is also information from other user about this behavior when Broadlink devices could not connect to cloud.

03:06:18 wireless,info CX:XX:XX:XX:XX:X1@wlan4: disconnected, received deauth: sending station leaving (3) 
03:06:22 wireless,info CX:XX:XX:XX:XX:X1@wlan4: connected, signal strength -49 
03:06:30 wireless,info CX:XX:XX:XX:XX:XA@wlan4: disconnected, received deauth: sending station leaving (3) 
03:06:33 wireless,info CX:XX:XX:XX:XX:XA@wlan4: connected, signal strength -58 
03:07:51 wireless,info CX:XX:XX:XX:XX:X6@wlan4: disconnected, received deauth: sending station leaving (3) 
03:07:54 wireless,info CX:XX:XX:XX:XX:X6@wlan4: connected, signal strength -52 
03:11:23 wireless,info CX:XX:XX:XX:XX:X1@wlan4: disconnected, received deauth: sending station leaving (3) 
03:11:27 wireless,info CX:XX:XX:XX:XX:X1@wlan4: connected, signal strength -47 
03:11:35 wireless,info CX:XX:XX:XX:XX:XA@wlan4: disconnected, received deauth: sending station leaving (3) 
03:11:38 wireless,info CX:XX:XX:XX:XX:XA@wlan4: connected, signal strength -54 
03:12:56 wireless,info CX:XX:XX:XX:XX:X6@wlan4: disconnected, received deauth: sending station leaving (3) 
03:13:00 wireless,info CX:XX:XX:XX:XX:X6@wlan4: connected, signal strength -56 
03:16:28 wireless,info CX:XX:XX:XX:XX:X1@wlan4: disconnected, received deauth: sending station leaving (3) 
03:16:32 wireless,info CX:XX:XX:XX:XX:X1@wlan4: connected, signal strength -49 
03:16:40 wireless,info CX:XX:XX:XX:XX:XA@wlan4: disconnected, received deauth: sending station leaving (3) 
03:16:44 wireless,info CX:XX:XX:XX:XX:XA@wlan4: connected, signal strength -56 
03:18:02 wireless,info CX:XX:XX:XX:XX:X6@wlan4: disconnected, received deauth: sending station leaving (3) 
03:18:06 wireless,info CX:XX:XX:XX:XX:X6@wlan4: connected, signal strength -52 
03:21:33 wireless,info CX:XX:XX:XX:XX:X1@wlan4: disconnected, received deauth: sending station leaving (3) 
03:21:37 wireless,info CX:XX:XX:XX:XX:X1@wlan4: connected, signal strength -47 
03:21:45 wireless,info CX:XX:XX:XX:XX:XA@wlan4: disconnected, received deauth: sending station leaving (3) 
03:21:49 wireless,info CX:XX:XX:XX:XX:XA@wlan4: connected, signal strength -53 
03:23:08 wireless,info CX:XX:XX:XX:XX:X6@wlan4: disconnected, received deauth: sending station leaving (3) 
03:23:12 wireless,info CX:XX:XX:XX:XX:X6@wlan4: connected, signal strength -52 
03:26:38 wireless,info CX:XX:XX:XX:XX:X1@wlan4: disconnected, received deauth: sending station leaving (3) 
03:26:42 wireless,info CX:XX:XX:XX:XX:X1@wlan4: connected, signal strength -49 
03:26:50 wireless,info CX:XX:XX:XX:XX:XA@wlan4: disconnected, received deauth: sending station leaving (3) 
03:26:54 wireless,info CX:XX:XX:XX:XX:XA@wlan4: connected, signal strength -59 
03:28:14 wireless,info CX:XX:XX:XX:XX:X6@wlan4: disconnected, received deauth: sending station leaving (3) 
03:28:17 wireless,info CX:XX:XX:XX:XX:X6@wlan4: connected, signal strength -54

Anyway it seems that RMs work OK, except these 3-5sec disconnect periods exactly each 5min. If it would be possible to set higher timeout on Broadlink integration in HA, I would better to wait for device to reconnect if my command will fall into this disconnected time, than to allow these black beans to connect to some # cloud.

litinoveweedle on Oct 21, 2020

I will make the updates optional, so users who are having problems can disable them.

felipediel on Oct 21, 2020

Just to shine some light to this issue, all RMs are constantly disconnecting/reconnecting to wi-fi IF they don’t have connection to internet. It seems to be some weird way of theirs watchdog implementation. As many users for obvious reason don’t allow smart home components used locally by HA to communicate to internet, this together with newly introduced feature is very annoying and seems to affect many users

Looking to my Mikrotik WLAN log, these reconnect attempts are rather short, about 3-5sec. It surely depend AP to AP, but I would like to propose on improvement of polling device state schedule.

litinoveweedle on Oct 20, 2020

I think that’s a great idea. For myself at least, having any recovery attempt would be the bees knees. If say, I updated my access points or switch firmware (automated, say at 2am, takes maybe 5 minutes), I should not need to restart HA to reconnect to devices. I would also find a command to mark the device as available as a good mitigation (that way it can be automated).

In this case, the network error looks to be during the time the broadlink device was roaming between access points, so connectivity would be restored within seconds. I haven’t found any other possible issues, the sensor was still responding after all, so I don’t think a substantial network error occurred.

Silvenga on Jul 2, 2020

core: Broadlink Component - Handling Of Communication Errors

The problem

Environment

Problem-relevant configuration.yaml

Traceback/Error logs

Additional information

About this issue

Most upvoted comments

Multiple networks / Blocked -> Slow recovery / Errors

Problem 3: Aggressive recovery mechanism when there are multiple WiFi networks configured

Problem 4: Poor network conditions can trigger the recovery mechanism too

Conclusion

Problem-relevant `configuration.yaml`