hubris: SP-RoT link ends up in a persistent bad state

While updating the colo rack with the dogfood image (https://github.com/oxidecomputer/omicron/runs/15999510273), we updated both the SP and RoT images using mupdate / wicket.

(Normally, we skip updating the RoTs during mupdate, because they haven’t changed in a while)

Augustus noticed one RoT showing a failure to communicate after the initial update; a second update brought this number to three RoTs.

Here’s an example of an unhappy RoT: image(1)

For contrast, a happy RoT: image(2)

The ringbuf shows invalid data received by the SP when we try to request the RoT status:

humility: ring buffer drv_stm32h7_sprot_server::__RINGBUF in sprot:
 NDX LINE      GEN    COUNT PAYLOAD
   0  241        1        1 Sent(0xa)
   1  283        1        1 Received(0x8)
   2  437        1        1 RxBuf([ 0x1, 0x0, 0x2, 0xe4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
   3  447        1        1 Error(Protocol(InvalidCrc))
   4  241        1        1 Sent(0xa)
   5  283        1        1 Received(0x8)
   6  437        1        1 RxBuf([ 0x1, 0x0, 0x2, 0xe4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
   7  447        1        1 Error(Protocol(InvalidCrc))
   8  241        1        1 Sent(0xa)
   9  283        1        1 Received(0x8)
  10  437        1        1 RxBuf([ 0x1, 0x0, 0x2, 0xe4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  11  447        1        1 Error(Protocol(InvalidCrc))
  12  457        1        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }

We would expect the RxBuf messages to begin with [3, 0, 0, 0], because that’s the current protocol version serialized with u32::to_le_bytes().

Looking at the spi_server_core ringbuf, we do see that pattern:

humility: ring buffer drv_stm32h7_spi_server_core::__RINGBUF in sprot:
 NDX LINE      GEN    COUNT PAYLOAD
  50  494        3        2 WaitISR(0x70012)
  51  464        3        1 Rx(0x0)
  52  494        3        2 WaitISR(0x60012)
  53  464        3        1 Rx(0x3)
  54  494        3        2 WaitISR(0x50012)
  55  464        3        1 Rx(0x0)
  56  494        3        2 WaitISR(0x40012)
  57  464        3        1 Rx(0x0)
  58  494        3        2 WaitISR(0x30012)
  59  464        3        1 Rx(0x0)
  60  494        3        2 WaitISR(0x20012)
  61  464        3        1 Rx(0x3)
  62  494        3        2 WaitISR(0x10012)
  63  464        3        1 Rx(0x0)
   0  322        4        1 Start(read, (0x0, 0x10))
   1  430        4        8 Tx(0x0)
   2  494        4        2 WaitISR(0x100000)
   3  464        4        1 Rx(0x1)
   4  430        4        1 Tx(0x0)
   5  494        4        1 WaitISR(0xf0000)
   6  494        4        1 WaitISR(0xf0002)
   7  464        4        1 Rx(0x0)
   8  430        4        1 Tx(0x0)
   9  494        4        1 WaitISR(0xe0000)
  10  494        4        1 WaitISR(0xe0002)
  11  464        4        1 Rx(0x2)
  12  430        4        1 Tx(0x0)
  13  494        4        1 WaitISR(0xd0000)
  14  494        4        1 WaitISR(0xd0002)
  15  464        4        1 Rx(0xe4)
  16  430        4        1 Tx(0x0)
  17  494        4        1 WaitISR(0xc0000)
  18  494        4        1 WaitISR(0xc0002)
  19  464        4        1 Rx(0x0)
  20  430        4        1 Tx(0x0)
  21  494        4        1 WaitISR(0xb0000)
  22  494        4        1 WaitISR(0xb0002)
  23  464        4        1 Rx(0x0)
  24  430        4        1 Tx(0x0)
  25  494        4        1 WaitISR(0xa0000)
  26  494        4        1 WaitISR(0xa0002)
  27  464        4        1 Rx(0x0)
  28  430        4        1 Tx(0x0)
  29  494        4        1 WaitISR(0x90000)
  30  494        4        1 WaitISR(0x90002)
  31  464        4        1 Rx(0x0)
  32  430        4        1 Tx(0x0)
  33  494        4        1 WaitISR(0x80010)
  34  494        4        1 WaitISR(0x80012)
  35  464        4        1 Rx(0x0)
  36  494        4        2 WaitISR(0x70012)
  37  464        4        1 Rx(0x0)
  38  494        4        2 WaitISR(0x60012)
  39  464        4        1 Rx(0x0)
  40  494        4        2 WaitISR(0x50012)
  41  464        4        1 Rx(0x0)
  42  494        4        2 WaitISR(0x40012)
  43  464        4        1 Rx(0x0)
  44  494        4        2 WaitISR(0x30012)
  45  464        4        1 Rx(0x0)
  46  494        4        2 WaitISR(0x20012)
  47  464        4        1 Rx(0x0)
  48  494        4        2 WaitISR(0x10012)
  49  464        4        1 Rx(0x0)

However, it occurs before the Start log in the ringbuf! That’s weird and surprising. Because we’re retrying three times, it’s likely from a previous message; unfortunately, the ringbuf isn’t long enough to record all three attempts.

This looks like a framing issue where the SP and RoT have gotten out of sync. There are FIFO buffers on each side’s SPI peripheral. Resetting the SP did not fix the issue; power-cycling the whole sled (via Ignition) does fix the issue. This leads us to suspect the RoT’s Tx FIFO.

(In particular, I’m wondering what happens if the RoT begins listening to SPI midway through an attempted transaction from the SP, since the SP will try talking to it unconditionally)

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 21 (19 by maintainers)

Commits related to this issue

Most upvoted comments

Recommendation from Matt: after hot path receive, the RoT checks for CSn asserted. If asserted, we have a synchronization error. RoT waits for CSn deasserted, then cleans up, and queues a SYNC_ERROR message, then asserts ROT_IRQ.

Recommendation from our discussion: on start, lpc55 sprot_serverr asserts ROT_IRQ until it has finished initializing FIFOs and SSA/SSD state. This does not fully cover the case where RoT has been reset and the SP steps on FIFO initialization, but it does cover some of that window and makes it much easier to find these failures on a logic analyzer.