esp-idf: ESP32 CAN controller delivers corrupted frames on RX FIFO overrun (IDFGH-2114)

Environment

  • Development Kit: none / OVMS3
  • Kit version (for WroverKit/PicoKit/DevKitC): none / OVMS3
  • Module or chip used: ESP32-WROVER 16MB
  • IDF version: all / doesn’t apply
  • Build System: Make
  • Compiler version: (crosstool-NG crosstool-ng-1.22.0-98-g4638c4f) 5.2.0
  • Operating System: Linux, macOS
  • Power Supply: USB, external 5V

Problem Description

On RX FIFO overrun, the ESP32 CAN controller delivers corrupted frames and false frame repetitions.

Expected Behavior

The ESP32 CAN controller is supposed to be SJA1000 compatible. We’re operating it with driver code derived from the original CAN driver by Thomas Barth (https://www.barth-dev.de/can-driver-esp32/), using the SJA1000 PeliCAN mode and fetching RX frames sequentially through the receive buffer.

Quoting from the SJA1000 spec sheet:

After reading the contents of the receive buffer, the CPU can release this memory space in the RXFIFO by setting the release receive buffer bit to logic 1. This may result in another message becoming immediately available within the receive buffer.

… the RXFIFO has space for 64 message bytes in total. It depends on the data length how many messages can fit in it at one time. If there is not enough space for a new message within the RXFIFO, the CAN controller generates a data overrun condition the moment this message becomes valid and the acceptance test was positive. A message which is partly written into the RXFIFO, when the data overrun situation occurs, is deleted.

The RMC register (CAN address 29) reflects the number of messages available within the RXFIFO. The value is incremented with each receive event and decremented by the release receive buffer command.

So according to the specs:

  • If no space is left in the FIFO for a new frame coming in (and passing the acceptance filter), that frame should be discarded completely.
  • It should not be counted.
  • It should not be passed through the receive buffer.
  • Just the overflow indicator should be set and the according interrupt be generated, so the driver knows some frame has been lost.

Actual Behavior

  • The frame causing the overflow is added to the FIFO partially (up to the FIFO border).
  • It’s also counted both in the RMC register…
  • …and indicated by RI and RBS as a valid frame when retrieving the FIFO contents.
  • On fetching the FIFO contents, the controller delivers the partial frame + some trashed bytes up to the nominal frame length.
  • After delivering the corrupted frame, the controller may continues delivering a number of false frames containing repetitions of the first frame in the FIFO.

Example:

A BMS delivering cell voltage & temperature readings sends blocks of 8 byte standard frames. On FIFO overflow, the CAN controller trashes bytes 7 & 8 on the sixth frame. A standard frame needs a 3 byte header + the data bytes in the FIFO, so the 6th frame exceeds the FIFO by two bytes. The first trash byte normally is “08”, the second “84” or “2a” or sometimes “ab”, possibly some internal SJA1000 data.

inv_msg: framecnt=13, invindex=6
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 | 4..?........$....@."7U.7
inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 | 4..?........%.....D..N.&
inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 | 4..?........T...7777777.
inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a | 4..?........V...1c.1S.1J
inv_msg: 57 05 00 00 31 43 14 31 53 15 08 2a | 4..?........W...1C.1S..*
                                       ^^^^^ trashed bytes here
… following 7 repetitions of the first frame:
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n

This behaviour (both the frame corruption and the false repetitions) applies to all methods reading the standard receive buffer, i.e. using the RMC (as is done by the current esp-idf can.c driver), checking the RBS indicator and checking the RI interrupt flag.

The workaround I’ve done for our driver is adding up the message lengths read during an RX fetch run and discarding all frames exceeding the 64 byte border. See function ESP32CAN_rxframe() in esp32can.cpp: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/blob/master/vehicle/OVMS.V3/components/esp32can/src/esp32can.cpp#L92

I suggest applying this workaround to the esp-idf driver as well and fixing the hardware in the next ESP32 revision.

Steps to repropduce

It should be reproducable by connecting two units running the CAN example, with one of the units temporarily disabling interrupts to force the FIFO overrun.

Note: the bug may need specific circumstances to occur in addition to the overflow, maybe the overflow happening on a specific byte position in the FIFO – I haven’t tried to determine that.

Code to reproduce this issue

Use esp-idf CAN example.

Debug Logs

none

Other items if possible

none

Project origin

https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 21

Commits related to this issue

Most upvoted comments

@dexterbg @neorevx sorry for not responding earlier. I’ve tested the overflow behavior myself, and here are my findings:

  • When the RX FIFO is empty and begins receiving messages

    • Bytes are filled into the RX FIFO, RMC is incremented for every message received
  • When a message arrives with more bytes than can fit in the RX FIFO’s remaining space

    • RMC is still incremented for the message
    • Whatever bytes of the message that can fit in the remaining space of the RX FIFO will be filled. The remaining bytes will be discarded.
  • When the RX FIFO is full but messages are still being received.

    • RMC is still incremented for each overrun message (up to 64).
    • None of the bytes of these overrun messages are written to RX FIFO because it is already full.
  • When RMC reaches 64, the RX FIFO becomes unrecoverable (due to an RTL bug).

    • The RX FIFO’s internal read pointer becomes out of sync, and subsequent calls to release the buffer may shift the buffer window by an incorrect amount, leading to corrupt messages.
    • Entering then exiting reset mode will reset the RX FIFO
    • If the RMC reaches 63, the RX FIFO is still recoverable.
  • The DOI interrupt and DOS status bits are both set when release buffer is called and the window rotates from a valid message to an overrun one.

    • DOI is cleared by reading the interrupt register, DOS is cleared by the CDO command
    • If the next message is also overrun, DOI and DOS will not be set again if release buffer is called. The two bits are only set on a transition of the buffer window from valid to overrun message.
  • Assuming that you are clearing the RX FIFO in a single sitting (i.e. in one continuous operation).

    • If RMC is 64. The buffer is unrecoverable. Enter and exit reset mode to reset the FIFO. Whatever valid messages in the RX FIFO are lost.
    • If RMC is <64, the buffer is recoverable. Keeping reading the valid messages and releasing the buffer until DOS or DOI is set (preferably DOS, because it isn’t auto cleared on a register read). The remaining messages are overrun, thus release the buffer N times until RMC is 0.

@dexterbg

I’m now trying to figure out why the ISR is sometimes blocked / delayed so long a FIFO overrun can actually occur.

Long critical sections or other same/higher priority interrupts are the usual culprit. Try reducing the length of your critical sections, or moving the CAN ISR to a less crowded core (basically call esp_intr_alloc() on which ever core to register on)