esp-idf: [TW#17765] I2C crashing - watchdog timeout (master & v3.0 branch)

I have a project that has three I2C slave devices on a single bus (running at 100kHz). For some time I was developing with ESP-IDF 2.1.1 and everything was working pretty well, except for a weird problem where the I2C master would freeze up after a few minutes. I did some research and it looks like this is a problem with the I2C master hardware state machine which has been addressed in more recent commits of ESP-IDF. So to make use of this fix I migrated my project to use master (595688a32ad653d8e6cb1c7682b813f96125853e). I had to make a few changes (remove references to FreeRTOS heap measurement commands, add nvs_flash_init() before initialising WiFi) but then everything seemed to work well. The slave devices are all being polled correctly and everything seems happy.

The project is here: https://github.com/DavidAntliff/esp32-poolmon/tree/ESP-IDF_master

I came back a little while later and the application is crashing over and over with the following console output shortly after boot:

Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0)
Register dump:
PC      : 0x400859f3  PS      : 0x00060034  A0      : 0x80084685  A1      : 0x3ffb0590  
0x400859f3: xQueueGenericSendFromISR at /Users/david/esp32/esp-idf-master/components/freertos/./queue.c:2037

A2      : 0x00000001  A3      : 0x00000000  A4      : 0x3ffb05b0  A5      : 0x00000002  
A6      : 0x3ffbc970  A7      : 0x00060021  A8      : 0x800859f3  A9      : 0x3ffb0570  
A10     : 0x00000000  A11     : 0x00000000  A12     : 0x00000002  A13     : 0x3ffbadc0  
A14     : 0x00000000  A15     : 0x400849fc  SAR     : 0x00000012  EXCCAUSE: 0x00000005  
0x400849fc: i2c_isr_handler_default at /Users/david/esp32/esp-idf-master/components/driver/./i2c.c:1023

EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff  

Backtrace: 0x400859f3:0x3ffb0590 0x40084682:0x3ffb05b0 0x40084a89:0x3ffb05e0 0x40082ba5:0x3ffb0610 0x4000bfed:0x00000000
0x400859f3: xQueueGenericSendFromISR at /Users/david/esp32/esp-idf-master/components/freertos/./queue.c:2037

0x40084682: i2c_master_cmd_begin_static at /Users/david/esp32/esp-idf-master/components/driver/./i2c.c:1023

0x40084a89: i2c_isr_handler_default at /Users/david/esp32/esp-idf-master/components/driver/./i2c.c:1023

0x40082ba5: _xt_lowint1 at /Users/david/esp32/esp-idf-master/components/freertos/./xtensa_vectors.S:1105

Rebooting...

A software or on-board reset does not stop this endless reset behaviour, however removing power for a short period of time does “fix” the issue. It is strange that a brief ESP32 reset does not clear it. (EDIT: but a long reset press does).

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 78 (59 by maintainers)

Most upvoted comments

Just wanted to let you know that there is a new commit in master from today that seems to fix this issue. I have not tested it yet. 😃

@koobest I just found a way to reset the I2C peripherals that seems to do a power on reset. It clears out all registers, resets the Bus_busy flags, initializes the hardware state back to a Power On condition. I work on the Arduino branch, a patch merge containing just this reset is arduino-esp32 pr 1201.

Chuck.

@DavidAntliff i2c_hw_fsm_reset() is attempting to work around the problem that the fsm will see a glitch on SDA as a multiMaster bus capture. Once the FSM sees ‘another’ master transacting the bus it will initiate a Bus_Busy State until it sees a complete transaction with a Valid STOP. The basis of this problem is the FSM’s interpretation of START:

  • A START condition is High SCL, HIGH SDA then SDA LOW then SCL LOW.
  • The FSM interprets SCL HIGH, SDA HIGH then SDA LOW then SDA HIGH. This LOW going ‘glitch’ on SDA should not be interpreted as START, It should just be ignored. The reason the protocol defines START as a voltage level sequence is to immunize the FSM from signal glitches.

Since all of the problems have been encountered by people using the ESP32 in a SINGLE Master I2C configuration, the FSM will infinitely hang waiting for the ‘other’ master to complete it’s transaction.

The use of additional GPIO pins to act as another I2C master will solve the Bus_Busy problems, Actual TIMEOUT interrupt cascades I haven’t solved.

static esp_err_t i2c_master_clear_bus(i2c_port_t i2c_num) acts like an I2C master to send out a null transaction. Except, there is no initial START(It drops SCL before SDA). So, it is another illegal transaction. i2c_master_clear_bus

 gpio_set_direction(scl_io, GPIO_MODE_OUTPUT_OD);
    gpio_set_direction(sda_io, GPIO_MODE_OUTPUT_OD);
    gpio_set_level(scl_io, 0);
    gpio_set_level(sda_io, 0);
    for (int i = 0; i < 9; i++) {
        gpio_set_level(scl_io, 1);
        gpio_set_level(scl_io, 0);
    }
    gpio_set_level(scl_io, 1);
    gpio_set_level(sda_io, 1);
    i2c_set_pin(i2c_num, sda_io, scl_io, 1, 1, I2C_MODE_MASTER);
return ESP_OK;

This code should be changed to something like this:

    gpio_set_level(scl_io, 1); // initial condition SCL needs to be High
    gpio_set_level(sda_io, 1);// initial conditions SDA needs to be high
    // a small delay (5us) 1/2 clock at 100khz
    gpio_set_level(sda_io, 0);// Issues START
    // a small delay (5us) 1/2 clock at 100khz
     for (int i = 0; i < 9; i++) {
        gpio_set_level(scl_io, 0);
    // a small delay (5us) 1/2 clock at 100khz
        gpio_set_level(scl_io, 1);
    // a small delay (5us) 1/2 clock at 100khz
    }
    gpio_set_level(sda_io, 1); // Issue STOP

The history of needing this function traces back to hardware glitches that occur when attaching the GPIO pins to the hardware peripheral. @ESP32DE and I solved these glitches for the Arduino environment proposal to i2c. I don’t know the equivalent pin assignment sequence for IDF. I don’t use directly use esp32-IDF.

In my testing I no longer need to execute this function at every boot.

with a quick looking through of the IDF i2c code, I see a few design idea I don’t support.

  • First, exiting through portYIELD_FROM_ISR(); at every opportunity. portYIELD_FROM_ISR(); should only be called at the end of the ISR. If one of the OS calls returns (HPTaskAwoken == pdTRUE), Then at the END of the ISR call portYIELD_FROM_ISR() instead of just exiting the ISR. HPTaskAwoken just means that the FOREGROUND process must Yield. not the ISR. so, as soon as you complete your ISR (which should be Short, Deterministic, and NO waits) Call portYIELD_FROM_ISR(); to task switch to the higher priority FOREGROUND task.

I think the multiple portYIELD_FROM_ISR() throughout the ISR is the cause/basis of this Issue. A single interrupt is not completing before the next byte moves. HPTaskAwoken only needs to be acted on ONCE, just before the end of the ISR. I would create a SINGLE HPTaskAwoken variable at the top of the ISR and pass it by reference to all sub-Functions. The FreeRTOS ISR functions will not clear it, only SET it. So, if any of the OS functions set it, it will cascade until that last step before the ISR returns. I haven’t studied this ISR, in my verion, I have a subfunction that does my exit operations it is the only place that can call portYIELD_FROM_ISR(). This code is from my ISR statemachine.

 if (activeInt & I2C_TRANS_COMPLETE_INT_ST_M) { 
    i2cIsrExit(p_i2c,EVENT_DONE,false);
    return; // no more work to do

inside i2cIsrExit() all of my ISR cleanup, foreground notifications, and a potential portYIELD_FROM_ISR() is executed, the return; exits from my ISR back to the interrupted foreground task.

From my point of view, ISR’s must complete. They are atomic. If an ISR can’t complete in a short FINITE timespan, it is coded wrong.

  • Second, I have discovered the FSM.command[] list has some order rules. If the command I2C_CMD_END is used, when the command[] buffer is refilled, the positions of command[] at and beyond where I2C_CMD_END was placed cannot be reused. In my implementation of I2C for Arduino, If I need to use I2C_CMD_END because of data block length I only use command[15] for end. The FSM allows each command to move upto 255 bytes, so before I need to use an ‘END’ a Master Write would have to send over 3060 bytes. A single Read is a little more limited because of the last byte NAK. I don’t see any consideration for these ‘rules’.

Chuck.

@luisonoff thank you for the alert!

I have tested ESP-IDF commit 391c3ff959f9eb1b2975cb0d7b29c0478f3b6a48 with my reproduction project on one of my “DOIT” boards and I can happily report that I am unable to reproduce my issue by rubbing SDA and SCL together rapidly. I spent maybe 4 minutes rapidly mechanically shorting them and did not see a single crash.

Then I reverted back to 2e7613b6560775b27c50eb81e81d5c3ff712b866 (just prior to the “fix”) and verified that the issue can be reproduced. In fact it was extremely easy to reproduce it, many times per minute.

So I can conclude from this that merge 892f3907fa2e074943e865b68f2fda3da600584b appears to resolve the issue, for me at least, on this board.

I’ll try it on my Wemos LoLin32 Lite next, and report back if the results are any different.

EDIT: looks good on the LoLin32 also - no crashes seen. Good work Espressif! Thank you.

Hi Luis, If there are many doubts, I suggest you to test with 3K3 resistors for 3.3V. 4K7 resistors are for 5V. Did you scope I2C interface with Logic analyzer ?

@Gustavomurta thanks for the advice. Here’s the thing - I know that my I2C bus isn’t perfect, and it would be good if I could condition my signals to avoid an issue, but the problem is that there’s always a risk of errors on the bus due to noise. In the event of a failed I2C transaction, the bus will be in an error state, and that’s fine if the software can detect that and return an error code to the caller. The problem is that the ESP32 I2C peripheral has a bug that causes its internal finite-state-machine (FSM) to lock up if SDA or SCK are electrically affected in certain ways. This is a known issue and acknowledged by Espressif. There is a fix in the 3.0 stream that attempts a FSM reset when there is a transaction timeout and the hardware busy flag is still raised. I see this fix activate sometimes and it seems to work. The issues that I have documented here are related to this, I think, but take it further:

  1. there’s some sort of condition that occurs when the bus is in an errored state that results in the interrupt service routine starving the scheduler and causing a watchdog reset.
  2. almost every single time this happens, the subsequent reset causes another crash because the FSM is in the locked state, and there’s no code in the I2C init function to reset the FSM state. The reset only occurs after a failed transaction, but at init time there are no transactions yet, so it doesn’t get cleared. I added a FSM reset and that appeared to avoid this second issue (a reset due to 1. no longer causes a chain of resets due to 2.)

Because 2. happens almost every single time 1. does, I suspect that 1. is related to the FSM failure. It may be a cause, or it may be incidental, I’m not sure. I don’t know enough about the FSM failure to know whether it can cause a flood of interrupts.

So my point is that although there’s a lot I can do to improve the I2C bus in my particular circuit, there’s a real issue with the ESP32 software interaction with the hardware at the moment that is causing I2C for multiple people, and Espressif are in the best possible place to investigate this now that there’s a way to reproduce it.

I am using external pull-ups BTW. The issue is also unrelated to bus speed. It happens at 10 kHz almost as often as it happens at 100 kHz.