openhab-core: Serial ports getting blocked after some re-connecting

When using the Modbus binding with OH3, the serial port with the Modbus RS485 transciever cannot be opened anymore after some time.

When the error appears, the following is printed on stdout: initialise_event_info_struct: initialise failed!

This is a known bug NeuronRobotics/nrjavaserial#111.

After compiling the native libs with DISABLE_LOCKFILES as suggested in above issue, the error is gone.

This has been discussed in https://community.openhab.org/t/modbus-binding-loses-its-serial-connection-every-few-days-without-a-reason-after-updating-to-2-5-0-release-build/89550/14

Maybe locking should be disabled in OH nrjavaserial again?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 68 (18 by maintainers)

Commits related to this issue

Most upvoted comments

Hey, as an NRJavaSerial contributor, I feel like I owe you folks my $0.02 on the underlying library issue at play here, and what I intend to do about it. My 👀 response to @fwolter’s comment back in November was supposed to indicate that I was aware of the issue, but as time wears on, I think that response comes off as a bit too glib.

I noticed a lot of deleted lock files for an openhab process (example):

NRJavaSerial leaks the file descriptors for its lockfiles. I haven’t hunted down where, exactly, but it seems to unlink(2) (or equivalent) the lockfile paths without close(2)'ing the open descriptors. (In your inotifywait output above, you can see that some of the files are opened twice, but only closed once.) As you’re finding, this eventually gums up the works for itself and other file access done within the same process. It’s been a while since I was last able to look at this, but IIRC, this manifests as an initialization error (initialise_event_info_struct: initialise failed!) and not a file open error due to limitations surrounding select(2) and FD_SETSIZE. The file can open: with an increased ulimit, you can open more than 1,024 (FD_SETSIZE) files. But the files are opened with large file descriptor numbers (>1,024), and so when the library tries to use the serial port, it calls FD_SET() with that larger-than-FD_SETSIZE descriptor, and encounters an error. The library solution is for us to use the newer poll(2) instead of select(2) – and to not leak file descriptors like a sieve in the first place.

This is why building NRJavaSerial with locking disabled circumvents the problem: no leaked lockfiles means no high file descriptor numbers, means select(2) is happy, and means no random initialization crashes.

New behaviour

  • when I/O error is encountered, serial port is kept open

Retaining/reusing old file descriptors (by reusing port objects) will avoid this problem, yes. Unfortunately, it can be really hard to distinguish when this is safe to do so: was the I/O error directly relating to the port hardware, e.g., a USB-to-serial dongle being disconnected, or was it a protocol error/unresponsive device on the far end? If you can rearchitect openHAB to continue to use old ports, you’ll still eventually be unable to open new ports, but you can keep using the ones you already have open. It would be a pretty clumsy workaround for a problem you shouldn’t have to deal with, but it should help.

Not sure why it does something with ttyAMA0, as I only use ttyUSB0 for RS485 communication.

As part of enumerating available ports (CommPortIdentifier.getPortIdentifiers()), NRJavaSerial tries to open each device in turn to confirm that the port is accessible to the current user. As far as I can tell, there’s no valid reason for it to open the port to do the access check instead of just… performing an access check (i.e., via access(2)). But on top of that, it doesn’t just open(2) the port – no, it goes through the entire usual rigmarole of locking/unlocking the port, too. So because port locking/unlocking leaks file descriptors, every port enumeration operation leaks file descriptors proportional to the number of serial ports attached to the system.

This is pretty high on the list of things I’d like to fix in NRJavaSerial. Unfortunately, the road to hell being paved with good intentions and all that, I haven’t gotten to it yet. Hopefully I can soon.

What does the release cycle for look like, by the way? Do you issue updates on an as-available basis, or is there a release window you try to hit? Obviously you’d like this fixed in NRJavaSerial ASAP, but is there a time (other than “yesterday!”) when it would be optimal to receive a fix by?

These files have names of the form /var/lock/tmp, where is a 6-character combination.

My understanding was that these temporary files were created, written to, and then moved to /var/lock/LCK..name, and I thought that’s where the leak was – the descriptor used to write to the lock files wasn’t closed. But now that I look at the code (grep for mktemp and mkstemp, because that’s the filename format), these files are actually created by check_group_uucp(). This function checks whether the current user has permissions to write to /var/lock. Again, I don’t know why this check is done by opening and writing to a file instead of just checking access(2) for W_OK. But now that I know this is where it’s being done, the source of the leak is obvious:

	if ( 0 == mkstemp(testLockAbsFileName) )
	{
		free(testLockAbsFileName);
		report_error("check_group_uucp(): mktemp malformed string - \
			should not happen");

		return 1;
	}

According to the man page for mkstemp(3), the function returns a read/write descriptor – but the return value is only error-checked (incorrectly; the check should be for == -1) and then discarded. The filename is then fopen(3)'d for writing, and that handle is correctly closed. I think whoever did this (pre-NRJavaSerial) meant to use mktemp(3), which only creates the file without opening it.

Thanks for triggering that investigation, @jamietownsend. I’ll open a PR for this in a jiffy, seeing as it should be a one-character change. Then we’ll need to re-test, etc. but it looks pretty straightforward. The other issues I outlined above are still valid, I think – access checking is inefficient, and poll(2) is preferable to select(2) – but probably don’t need to be addressed in order to fix this.

The fix will be in the next release, v5.3.0.

Snapshot seems to work. No problems until yet, USB-devices all still on and running 😃 Great work

Several of us are hoping 5.3 fixes our openHAB Debian issue with serial devices. For me it’s z-wave not reconnecting after reboot.

On Thu, Aug 19, 2021 at 9:21 AM farfade @.***> wrote:

@kaikreuzer https://github.com/kaikreuzer , if I understand correctly, getting nrjavaserial 5.3 embedded into openhab 3 is blocking for updating the Debian underlying OS to Bullseye if we want to have a chance to get serial stuff working ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openhab/openhab-core/issues/1842#issuecomment-901957807, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM4INWK3A4ENSJ6WC5I4URTT5UHO5ANCNFSM4T6WHNMA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

@digitaldan the thinking is actually that the issue is with nrjavaserial alone. All you need is one binding using serial. The way nrjavaserial is compiled for OH3 is different compared to OH2, something that might trigger this. #1842 (comment)

Yeah, thats what i was trying to say, that i don’t have any other serial bindings, and yes, of course I read the first comment in this issue, thanks.

My observation is, that it doesn’t seem to depend on the number of bindings.

The only way this repeats for me is when my load profile changes, then its nearly 100% constant. This is at least my observation after trying this many, many times. I’m sure its probably different for others. I have no idea why that is, but I can reproduce it reliability.

My setup works flawlessly for months now, after disabling the locks in nrjavaserial.

I’m going to give that a shot as well. I’m a little concerned that this ends up becoming a bigger issue for us now that it looks to affect z-wave and other serial binding users. Was there a bump in the nrjavaserial version we think caused this? It seems to me to be a recent development.

I understood you == me 😃 I’d do that if the pain gets too big. Until now there is only one user beside me facing the problem. Maybe @MrDOS manages to re-work the locking mechanism in the next time.