home_assistant_solarman: Invalid modbus packets accepted and give corrupted data

Well, not invalid as such - the checksum is validated as correct, but the packet is definitely decoded incorrectly. But something very strange is going on… I’ll try to explain…

I have one of the 23xxxx loggers (LSW3_15_FFFF_1.0.65) on a Sofar HYD 6000 and I have created a custom profile for it that works correctly. (If your interested, its just an extended version of the wifikit definition: sofar_hyd_x000.txt)

Every now and then (randomly, it can be 5 minutes, it can be 5 hours) an unusual response packet is received from the logger. It has a checksum that validates and the logger serial number is correct, but it looks like the logger is missing some of the reply out - comparing with a complete reply shows some similarities.

By enabling the debug logging (and adding some extra to show when a retry was made) I see:

Initial request for 0x200-0x255 (serial blanked) : a5170010450000XXXXXXXX020000000000000000000000000000010302000056c44cd015

‘Invalid’ reply:

a5a900101500fbXXXXXXXX0201414f05007a20000088eb46630103ac000200000000000000000000099b01d20000000000000000137dff8a020ff70e004b0017000100740076000009a600000b1103c4008f0637000000410000000c0000001d00000048039c026a00000018000000110002002000010f140d17febe099b0124000000000000000000000064003100270010000a1ffe00000000000000000000000003730000003a2ee02ee002f10001000000009815

… at this point, since the checksum is correct, parsing is attempted. This for me, without fail, blows up with an exception whilst parsing a string type value. This is caught, result=0, and the request (identical to the first) is made a second time and the following response is received:

a5d500101500f8XXXXXXXX0201064f05003e20000088eb4663000000000000000007c000010000078000030000cd270103ac000200000000000000000000099601ce00000000000000001385ff93020ff737004b0017fffd0070006d000009a200000b1103c4008f0635000000410000000c0000001d00000048039c026800000018000000110002002000010f070d18febc09960124000000000000000000000064003100270010fffb1ffe00000000000000000000000003720000003a2ee02ee002f1000100000000000000000000000007c000010000078000030000cd27c915

It is 44 bytes longer. This is parsed without any exceptions being thrown… however the numbers are way off… my 16 panels apparently generate hundreds of kW for a minute… or battery SoC is >100%… you get the idea!

The invalid values always occur upon a retry. Thankfully, 99% of the time, a retry is not required and everything works correctly - however it really throws off HA’s energy panel, and due to the size of the numbers, the auto scaling on graphs make them impossible to read.

Here is an example of a good reply (sans serial) that was sent without requiring a retry in the minute following the exchange above:

a5d500101500fcXXXXXXXX0201424f05007a20000088eb4663000000000000000007e400010000073f000200000b620103ac000200000000000000000000099b01d20000000000000000137dff8a020ff70e004b0017000100740076000009a600000b1103c4008f0637000000410000000c0000001d00000048039c026a00000018000000110002002000010f140d17febe099b0124000000000000000000000064003100270010000a1ffe00000000000000000000000003730000003a2ee02ee002f1000100000000000000000000000007e400010000073f000200000b620815

Now that I have all the packets out of the way, I have some additional info: In the ‘parsed incorrectly’ failure case above my first string of panels, PV1 (register 0x252) was generating 120,000kW(!). Now in my case power is scaled by 10, so I guess it can fit in a 16 bit register as 0x2ee0… the thing is I can find 0x2ee0 in the both of the above packets, so I don’t immediately see how it would parse differently upon a retry.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (8 by maintainers)

Most upvoted comments

I’ve discovered that the problem occurs because the modbus frame that is encapsulated within is itself invalid - it is likely that the logger doesn’t check the modbus CRC, and just sends the frame as is. This would explain why the outer, v5 logger checksum is correct, but the inner modbus CRC is invalid.

After noticing the test scripts written with the pysolarmanv5 library are able to detect the invalid packets, I investigated it to see why. That library performs a number of additional checks on the received frame on top of validating the checksum. It checks:

  1. The length of the received frame (len(bytes)) == reported payload length (bytes[1:3]) + static frame header size (13)
  2. Start and end bytes for frame are as expected (start: 0xA5, end: 0x15)
  3. The frame checksum (bytes[len - 2]) matches (using the same algorithm used in the integration)
  4. The CRC of the encapsulated modbus frame (bytes[25: len-2]) is correct (not the frame CRC tsi HA integra.
  5. The logger serial number (bytes[7:11]) is as expected
  6. The control code (bytes[3:5]) is as expected (0x1510) - a comment mentions that sometimes keep alives with a control code of 0x4710 are discovered.
  7. The frame type (bytes[11]) is 0x02
  8. The sequence number (bytes[5]) is as expected (the integration labels this field ‘SERIAL’, and leaves it zeroed. It is distinct from the logger serial, I guess it is the frame ‘serial’ number?)

After hacking these steps into the HA Solarman code, I’ve found that the bad packets either fail at step 4 - the validation of the modbus CRC, or step 6, an unexpected control code. This causes the retry to be sent and all works as expected!

I’ll leave this running for a while to confirm it works completely, then clean up my version of the implementation.

I only pull the data once per minute. Maybe you are overloading the poor logger! Like most commercial devices out there, it’s probably using the cheapest MCU they could get away with 😃

The reason I mentioned our different inverters is because I believe the logger is merely wrapping the unchecked modbus reply from the inverter. ie - the inverter is the device that it’s producing modbus, not the logger. As such, different inverters could indeed experience similar, but different problems because they each have a different modbus implementation.

If the changes I made didn’t fix the problem I saw, one thing I did also consider changing was the addition of a small delay (say 250ms) between retries e.g. in the current solarman.py

from time import sleep
...
...
            if 0 == self.send_request(params, start, end, mb_fc):
                # retry once
                time.sleep(0.25)
                if 0 == self.send_request(params, start, end, mb_fc):
...
...

Maybe that would help you? Obviously, too much of a delay will cause problems elsewhere!

I found it useful to add some additional log.debug() around that point so that you know when you are sending the request, a retry, or if the retry has also failed - you may wish to do that too.

I will also keep a close eye on those superpower anomalies

image

This appears to be working nicely now - I haven’t experienced a single corrupted value, and I am getting nice graphs of metric within HA. I’ve cleaned up the code ready to submit.