esp-idf: ESP32 CAN controller delivers corrupted frames on RX FIFO overrun (IDFGH-2114)
Environment
- Development Kit: none / OVMS3
- Kit version (for WroverKit/PicoKit/DevKitC): none / OVMS3
- Module or chip used: ESP32-WROVER 16MB
- IDF version: all / doesn’t apply
- Build System: Make
- Compiler version: (crosstool-NG crosstool-ng-1.22.0-98-g4638c4f) 5.2.0
- Operating System: Linux, macOS
- Power Supply: USB, external 5V
Problem Description
On RX FIFO overrun, the ESP32 CAN controller delivers corrupted frames and false frame repetitions.
Expected Behavior
The ESP32 CAN controller is supposed to be SJA1000 compatible. We’re operating it with driver code derived from the original CAN driver by Thomas Barth (https://www.barth-dev.de/can-driver-esp32/), using the SJA1000 PeliCAN mode and fetching RX frames sequentially through the receive buffer.
Quoting from the SJA1000 spec sheet:
After reading the contents of the receive buffer, the CPU can release this memory space in the RXFIFO by setting the release receive buffer bit to logic 1. This may result in another message becoming immediately available within the receive buffer.
… the RXFIFO has space for 64 message bytes in total. It depends on the data length how many messages can fit in it at one time. If there is not enough space for a new message within the RXFIFO, the CAN controller generates a data overrun condition the moment this message becomes valid and the acceptance test was positive. A message which is partly written into the RXFIFO, when the data overrun situation occurs, is deleted.
The RMC register (CAN address 29) reflects the number of messages available within the RXFIFO. The value is incremented with each receive event and decremented by the release receive buffer command.
So according to the specs:
- If no space is left in the FIFO for a new frame coming in (and passing the acceptance filter), that frame should be discarded completely.
- It should not be counted.
- It should not be passed through the receive buffer.
- Just the overflow indicator should be set and the according interrupt be generated, so the driver knows some frame has been lost.
Actual Behavior
- The frame causing the overflow is added to the FIFO partially (up to the FIFO border).
- It’s also counted both in the
RMCregister… - …and indicated by
RIandRBSas a valid frame when retrieving the FIFO contents. - On fetching the FIFO contents, the controller delivers the partial frame + some trashed bytes up to the nominal frame length.
- After delivering the corrupted frame, the controller may continues delivering a number of false frames containing repetitions of the first frame in the FIFO.
Example:
A BMS delivering cell voltage & temperature readings sends blocks of 8 byte standard frames. On FIFO overflow, the CAN controller trashes bytes 7 & 8 on the sixth frame. A standard frame needs a 3 byte header + the data bytes in the FIFO, so the 6th frame exceeds the FIFO by two bytes. The first trash byte normally is “08”, the second “84” or “2a” or sometimes “ab”, possibly some internal SJA1000 data.
inv_msg: framecnt=13, invindex=6
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 | 4..?........$....@."7U.7
inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 | 4..?........%.....D..N.&
inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 | 4..?........T...7777777.
inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a | 4..?........V...1c.1S.1J
inv_msg: 57 05 00 00 31 43 14 31 53 15 08 2a | 4..?........W...1C.1S..*
^^^^^ trashed bytes here
… following 7 repetitions of the first frame:
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
This behaviour (both the frame corruption and the false repetitions) applies to
all methods reading the standard receive buffer, i.e. using the RMC (as is
done by the current esp-idf can.c driver), checking the RBS indicator and
checking the RI interrupt flag.
The workaround I’ve done for our driver is adding up the message lengths read
during an RX fetch run and discarding all frames exceeding the 64 byte border.
See function ESP32CAN_rxframe() in esp32can.cpp:
https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/blob/master/vehicle/OVMS.V3/components/esp32can/src/esp32can.cpp#L92
I suggest applying this workaround to the esp-idf driver as well and fixing the hardware in the next ESP32 revision.
Steps to repropduce
It should be reproducable by connecting two units running the CAN example, with one of the units temporarily disabling interrupts to force the FIFO overrun.
Note: the bug may need specific circumstances to occur in addition to the overflow, maybe the overflow happening on a specific byte position in the FIFO – I haven’t tried to determine that.
Code to reproduce this issue
Use esp-idf CAN example.
Debug Logs
none
Other items if possible
none
Project origin
https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 21
Commits related to this issue
- ESP32CAN: RX handling rework according to new Espressif infos See comment. Info source: https://github.com/espressif/esp-idf/issues/4276#issuecomment-548753085 Further experimentation showed looping... — committed to openvehicles/Open-Vehicle-Monitoring-System-3 by dexterbg 5 years ago
- ESP32CAN: simplified FIFO reset (doesn't need a full controller reinit) See https://github.com/espressif/esp-idf/issues/4276#issuecomment-555521678 — committed to openvehicles/Open-Vehicle-Monitoring-System-3 by dexterbg 5 years ago
- First attempt at porting the first of four espressif TWAI/CAN errata fixes For reference, here is Michael's original issue: https://github.com/espressif/esp-idf/issues/4276#issuecomment-54875308... — committed to leres/Open-Vehicle-Monitoring-System-3 by leres 3 years ago
- First attempt at porting the first of four espressif TWAI/CAN errata fixes For reference, here is Michael's original issue: https://github.com/espressif/esp-idf/issues/4276#issuecomment-54875308... — committed to leres/Open-Vehicle-Monitoring-System-3 by leres 3 years ago
- First attempt at porting the first of four espressif TWAI/CAN errata fixes For reference, here is Michael's original issue: https://github.com/espressif/esp-idf/issues/4276#issuecomment-54875308... — committed to leres/Open-Vehicle-Monitoring-System-3 by leres 3 years ago
- Add sendContent overload that takes a const char* and a length (#4276) The web server currently lacks the ability to send a buffer. Only strings are supported. This PR adds an overload to sendCont... — committed to 0xFEEDC0DE64/esp-idf by nikeee 4 years ago
- TWAI: FIFO overrun handling and errata workarounds This commit adds handling for FIFO overruns and adds workarounds for HW errats on the ESP32. Closes https://github.com/espressif/esp-idf/issues/251... — committed to espressif/esp-idf by Dazza0 3 years ago
- First attempt at porting the first of four espressif TWAI/CAN errata fixes For reference, here is Michael's original issue: https://github.com/espressif/esp-idf/issues/4276#issuecomment-54875308... — committed to leres/Open-Vehicle-Monitoring-System-3 by leres 3 years ago
- First attempt at porting the first of four espressif TWAI/CAN errata fixes For reference, here is Michael's original issue: https://github.com/espressif/esp-idf/issues/4276#issuecomment-54875308... — committed to leres/Open-Vehicle-Monitoring-System-3 by leres 3 years ago
- First attempt at porting the first of four espressif TWAI/CAN errata fixes For reference, here is Michael's original issue: https://github.com/espressif/esp-idf/issues/4276#issuecomment-54875308... — committed to leres/Open-Vehicle-Monitoring-System-3 by leres 3 years ago
- First attempt at porting the first of four espressif TWAI/CAN errata fixes For reference, here is Michael's original issue: https://github.com/espressif/esp-idf/issues/4276#issuecomment-54875308... — committed to leres/Open-Vehicle-Monitoring-System-3 by leres 3 years ago
- Second of four espressif TWAI/CAN errata fixes: SW workaround for TX interrupt lost For reference, here is Michael's original issue: https://github.com/espressif/esp-idf/issues/4276#issuecomment... — committed to leres/Open-Vehicle-Monitoring-System-3 by leres 3 years ago
- Second of four espressif TWAI/CAN errata fixes: SW workaround for TX interrupt lost For reference, here is Michael's original issue: https://github.com/espressif/esp-idf/issues/4276#issuecomment... — committed to leres/Open-Vehicle-Monitoring-System-3 by leres 3 years ago
- TWAI: FIFO overrun handling and errata workarounds This commit adds handling for FIFO overruns and adds workarounds for HW erratas on the ESP32. Closes https://github.com/espressif/esp-idf/issues/25... — committed to espressif/esp-idf by Dazza0 2 years ago
@dexterbg @neorevx sorry for not responding earlier. I’ve tested the overflow behavior myself, and here are my findings:
When the RX FIFO is empty and begins receiving messages
When a message arrives with more bytes than can fit in the RX FIFO’s remaining space
When the RX FIFO is full but messages are still being received.
When RMC reaches 64, the RX FIFO becomes unrecoverable (due to an RTL bug).
The DOI interrupt and DOS status bits are both set when release buffer is called and the window rotates from a valid message to an overrun one.
Assuming that you are clearing the RX FIFO in a single sitting (i.e. in one continuous operation).
@dexterbg
Long critical sections or other same/higher priority interrupts are the usual culprit. Try reducing the length of your critical sections, or moving the CAN ISR to a less crowded core (basically call
esp_intr_alloc()on which ever core to register on)