openhab-core: Serial ports getting blocked after some re-connecting
When using the Modbus binding with OH3, the serial port with the Modbus RS485 transciever cannot be opened anymore after some time.
When the error appears, the following is printed on stdout: initialise_event_info_struct: initialise failed!
This is a known bug NeuronRobotics/nrjavaserial#111.
After compiling the native libs with DISABLE_LOCKFILES
as suggested in above issue, the error is gone.
This has been discussed in https://community.openhab.org/t/modbus-binding-loses-its-serial-connection-every-few-days-without-a-reason-after-updating-to-2-5-0-release-build/89550/14
Maybe locking should be disabled in OH nrjavaserial again?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 68 (18 by maintainers)
Commits related to this issue
- [modbus] Workaround for nrjavaserial issues: do not disconnect serial Workaround for https://github.com/openhab/openhab-core/issues/1842. Typically serial connections are disconnected and reconnected... — committed to ssalonen/openhab-core by ssalonen 3 years ago
- [modbus] Workaround for nrjavaserial issues: do not disconnect serial Workaround for https://github.com/openhab/openhab-core/issues/1842. Typically serial connections are disconnected and reconnected... — committed to ssalonen/openhab-core by ssalonen 3 years ago
- [modbus] Workaround for nrjavaserial issues: do not disconnect serial (#2272) Workaround for https://github.com/openhab/openhab-core/issues/1842. Typically serial connections are disconnected and re... — committed to openhab/openhab-core by ssalonen 3 years ago
- [modbus] Workaround for nrjavaserial issues: do not disconnect serial (#2272) Workaround for https://github.com/openhab/openhab-core/issues/1842. Typically serial connections are disconnected and re... — committed to fwolter/openhab-core by ssalonen 3 years ago
- Subject: [PATCH] [modbus] Workaround for nrjavaserial issues: do not disconnect serial (#2272) Workaround for https://github.com/openhab/openhab-core/issues/1842. Typically serial connections are di... — committed to opensmarthouse/opensmarthouse-core by ssalonen 3 years ago
- Subject: [PATCH] [modbus] Workaround for nrjavaserial issues: do not disconnect serial (#2272) Workaround for https://github.com/openhab/openhab-core/issues/1842. Typically serial connections are di... — committed to opensmarthouse/opensmarthouse-core by ssalonen 3 years ago
- Oh updates to 3.1.0M1 (#29) * Subject: [PATCH] Fixed NPE in ScriptError if INode is empty (#1985) Signed-off-by: Christoph Weitkamp <github@christophweitkamp.de> --- X-Backport-Id: 7ce96ac06adcb... — committed to opensmarthouse/opensmarthouse-core by cdjackson 3 years ago
- Upgrade nrjavaserial to fix file descriptor leak Uses an openHAB 5.2.1.OH1 build based on the latest changes in the nrjavaserial master branch (https://github.com/NeuronRobotics/nrjavaserial/tree/7aa... — committed to wborn/openhab-core by wborn 2 years ago
- Upgrade nrjavaserial to fix file descriptor leak (#2899) Uses an openHAB 5.2.1.OH1 build based on the latest changes in the nrjavaserial master branch (https://github.com/NeuronRobotics/nrjavaserial/... — committed to openhab/openhab-core by wborn 2 years ago
- Upgrade nrjavaserial to fix file descriptor leak (#2899) Uses an openHAB 5.2.1.OH1 build based on the latest changes in the nrjavaserial master branch (https://github.com/NeuronRobotics/nrjavaserial/... — committed to openhab/openhab-core by wborn 2 years ago
- [modbus] Workaround for nrjavaserial issues: do not disconnect serial (#2272) Workaround for https://github.com/openhab/openhab-core/issues/1842. Typically serial connections are disconnected and rec... — committed to ConnectorIO/copybara-hab-core by ssalonen 3 years ago
- Upgrade nrjavaserial to fix file descriptor leak (#2899) Uses an openHAB 5.2.1.OH1 build based on the latest changes in the nrjavaserial master branch (https://github.com/NeuronRobotics/nrjavaserial/... — committed to ConnectorIO/copybara-hab-core by wborn 2 years ago
Hey, as an NRJavaSerial contributor, I feel like I owe you folks my $0.02 on the underlying library issue at play here, and what I intend to do about it. My đ response to @fwolterâs comment back in November was supposed to indicate that I was aware of the issue, but as time wears on, I think that response comes off as a bit too glib.
NRJavaSerial leaks the file descriptors for its lockfiles. I havenât hunted down where, exactly, but it seems to
unlink(2)
(or equivalent) the lockfile paths withoutclose(2)
'ing the open descriptors. (In yourinotifywait
output above, you can see that some of the files are opened twice, but only closed once.) As youâre finding, this eventually gums up the works for itself and other file access done within the same process. Itâs been a while since I was last able to look at this, but IIRC, this manifests as an initialization error (initialise_event_info_struct: initialise failed!
) and not a file open error due to limitations surroundingselect(2)
andFD_SETSIZE
. The file can open: with an increased ulimit, you can open more than 1,024 (FD_SETSIZE
) files. But the files are opened with large file descriptor numbers (>1,024), and so when the library tries to use the serial port, it callsFD_SET()
with that larger-than-FD_SETSIZE
descriptor, and encounters an error. The library solution is for us to use the newerpoll(2)
instead ofselect(2)
â and to not leak file descriptors like a sieve in the first place.This is why building NRJavaSerial with locking disabled circumvents the problem: no leaked lockfiles means no high file descriptor numbers, means
select(2)
is happy, and means no random initialization crashes.Retaining/reusing old file descriptors (by reusing port objects) will avoid this problem, yes. Unfortunately, it can be really hard to distinguish when this is safe to do so: was the I/O error directly relating to the port hardware, e.g., a USB-to-serial dongle being disconnected, or was it a protocol error/unresponsive device on the far end? If you can rearchitect openHAB to continue to use old ports, youâll still eventually be unable to open new ports, but you can keep using the ones you already have open. It would be a pretty clumsy workaround for a problem you shouldnât have to deal with, but it should help.
As part of enumerating available ports (
CommPortIdentifier.getPortIdentifiers()
), NRJavaSerial tries to open each device in turn to confirm that the port is accessible to the current user. As far as I can tell, thereâs no valid reason for it to open the port to do the access check instead of just⌠performing an access check (i.e., viaaccess(2)
). But on top of that, it doesnât justopen(2)
the port â no, it goes through the entire usual rigmarole of locking/unlocking the port, too. So because port locking/unlocking leaks file descriptors, every port enumeration operation leaks file descriptors proportional to the number of serial ports attached to the system.This is pretty high on the list of things Iâd like to fix in NRJavaSerial. Unfortunately, the road to hell being paved with good intentions and all that, I havenât gotten to it yet. Hopefully I can soon.
What does the release cycle for look like, by the way? Do you issue updates on an as-available basis, or is there a release window you try to hit? Obviously youâd like this fixed in NRJavaSerial ASAP, but is there a time (other than âyesterday!â) when it would be optimal to receive a fix by?
My understanding was that these temporary files were created, written to, and then moved to
/var/lock/LCK..name
, and I thought thatâs where the leak was â the descriptor used to write to the lock files wasnât closed. But now that I look at the code (grep formktemp
andmkstemp
, because thatâs the filename format), these files are actually created bycheck_group_uucp()
. This function checks whether the current user has permissions to write to/var/lock
. Again, I donât know why this check is done by opening and writing to a file instead of just checkingaccess(2)
forW_OK
. But now that I know this is where itâs being done, the source of the leak is obvious:According to the man page for
mkstemp(3)
, the function returns a read/write descriptor â but the return value is only error-checked (incorrectly; the check should be for== -1
) and then discarded. The filename is thenfopen(3)
'd for writing, and that handle is correctly closed. I think whoever did this (pre-NRJavaSerial) meant to usemktemp(3)
, which only creates the file without opening it.Thanks for triggering that investigation, @jamietownsend. Iâll open a PR for this in a jiffy, seeing as it should be a one-character change. Then weâll need to re-test, etc. but it looks pretty straightforward. The other issues I outlined above are still valid, I think â access checking is inefficient, and
poll(2)
is preferable toselect(2)
â but probably donât need to be addressed in order to fix this.The fix will be in the next release, v5.3.0.
Snapshot seems to work. No problems until yet, USB-devices all still on and running đ Great work
Several of us are hoping 5.3 fixes our openHAB Debian issue with serial devices. For me itâs z-wave not reconnecting after reboot.
On Thu, Aug 19, 2021 at 9:21 AM farfade @.***> wrote:
Yes, https://github.com/NeuronRobotics/nrjavaserial/issues/111.
Yeah, thats what i was trying to say, that i donât have any other serial bindings, and yes, of course I read the first comment in this issue, thanks.
The only way this repeats for me is when my load profile changes, then its nearly 100% constant. This is at least my observation after trying this many, many times. Iâm sure its probably different for others. I have no idea why that is, but I can reproduce it reliability.
Iâm going to give that a shot as well. Iâm a little concerned that this ends up becoming a bigger issue for us now that it looks to affect z-wave and other serial binding users. Was there a bump in the nrjavaserial version we think caused this? It seems to me to be a recent development.
I understood you == me đ Iâd do that if the pain gets too big. Until now there is only one user beside me facing the problem. Maybe @MrDOS manages to re-work the locking mechanism in the next time.