esp-idf: W5500 fails after 20 minutes of operation (IDFGH-10018)

Answers checklist.

I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.2-dev-321-ga8b6a70620

Operating System used.

Linux

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

None

Development Kit.

ESP32-C6-DevKit M1

Power Supply used.

USB

What is the expected behavior?

W5500 should be stable over a long time.

What is the actual behavior?

After 20 minutes, it stops operating. See debug log.

Steps to reproduce.

Use an ethernet example.

Debug Logs.

E (5516841) w5500.mac: emac_w5500_read_phy_reg(335): read PHY register failed
E (5516841) w5500.phy: w5500_update_link_duplex_speed(69): read PHYCFG failed
E (5516841) w5500.phy: w5500_get_link(112): update link duplex speed failed
E (5516901) w5500.mac: w5500_get_rx_received_size(152): read RX RSR failed
E (5518051) w5500.mac: w5500_get_rx_received_size(152): read RX RSR failed



### More Information.

I'm using the C6-Devkit w/ a W5500 connected via SPI.

About this issue

Original URL
State: closed
Created a year ago
Comments: 27 (8 by maintainers)

Most upvoted comments

Thanks a lot. Unfortunately, vTaskDelay’s minimum delay is one millisecond (w/ CONFIG_FREERTOS_HZ=1000), and this is quite long, considering that at a (moderate) speed of 500KBit/s of CAN I might receive at least 4 frames during that time. At 1MBit/s, it could be 8 frames, se we can’t really sleep.

That said, I have recently published esp-microsleep, which works around that issue.

On another note… applications like mine might be an interesting playground for FreeRTOS AMP – where time critical stuff (like CAN-FD in our case) happens on a non-FreeRTOS core.

mickeyl on Mar 17, 2024

As long as there are no sudden spikes of data that can swamp the thread that might be fine. Personally I’m a belts and suspenders kind of guy so I would likely have added a very short vTaskDelay just in case. 😁 Good luck on your project

JimmyPedersen on Mar 17, 2024

@JimmyPedersen Yes, usually this is correct, but in this case it shouldn’t be necessary, because the receiver task is already waiting in a FreeRTOS-aware function for the next message from the queue. This becomes only a problem if the IRQs are coming in so fast that the system is starving in general – which isn’t a real problem in the field, since we rarely reach 70% bus load.

mickeyl on Mar 17, 2024

@kostaond

In an attempt to finally dive deeper into this issue, I took the weekend and carried out a lot of stress-tests with the following setup:

Our product is a custom ESP32S3-based board with a W5500 (SPI3_HOST) and an MCP2518fd CAN controller (SPI2_HOST). The design follows the product recommendations from Wiznet and Microchip very closely and it has been done by an experienced EE, so I’m pretty sure there are no hardware issues. The test software configures the SPI peripherals bus speeds for W5500 at 20MHz and the MCP2518fd at 16MHz (slightly lower than the supported maximum of 17.5MHz). The software is compiled with -Os and a log level of I.

The application on the device under test is based on examples/network/bridge, where WiFi and the W5500 Ethernet is combined to one virtual interface running a DHCP server. The MCP2518fd driver simply echoes all the frames it receives back on the CAN bus.

To test the performance, there are three auxilary systems attached:

A Raspberry Pi 4 via a cross-over Ethernet cable to the W5500. On this machine, I’m running iperf -s:

iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------

A Raspberry Pi 4 with a CAN (gs_usb) attached to the MCP2518fd. The bus is properly terminated. This machine sends random can-bus frames very quickly using cangen can0 -g 0.45. This leads to a 70% bus load (measured with canbusload -cbr can0@500000), which is almost saturating the ISR.
A Linux PC (Thinkpad X1 Carbon 2021), which connects to the ESP32S3 via WiFi. It runs the client part of iperf endlessly.

I let this run for several hours, at first with a CPU speed of 160MHz and DIO. Without CAN traffic I’m getting the following speeds:

[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.5857 sec  10.9 MBytes  8.62 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 52050
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.4410 sec  10.9 MBytes  8.74 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 48290
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.5355 sec  10.9 MBytes  8.66 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 38148
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.6031 sec  11.0 MBytes  8.70 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 57326
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.5154 sec  10.9 MBytes  8.68 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 56776
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.4982 sec  10.8 MBytes  8.59 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 41376
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.3880 sec  10.8 MBytes  8.68 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 59300
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.4865 sec  10.8 MBytes  8.60 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 42276
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.4605 sec  10.8 MBytes  8.62 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 51660
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.6646 sec  11.0 MBytes  8.65 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 35296

With CAN traffic I’m getting the following speeds:

[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.8194 sec  10.0 MBytes  7.75 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 59222
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.8526 sec  10.1 MBytes  7.83 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 41688
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.4855 sec  9.63 MBytes  7.70 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 39026
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.4220 sec  9.63 MBytes  7.75 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 47770
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.9810 sec  10.1 MBytes  7.73 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 59350
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.9278 sec  10.1 MBytes  7.77 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 48874
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.7493 sec  9.88 MBytes  7.71 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 42004
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.4291 sec  9.63 MBytes  7.74 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 51532

The slight drop in performance is probably a factor of the higher systemload, so I’m fine with it. For a while this went ok, including the occasional watchdog complaint…

E (1932971) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time:
E (1932971) task_wdt:  - IDLE1 (CPU 1)
E (1932971) task_wdt: Tasks currently running:
E (1932971) task_wdt: CPU 0: wifi
E (1932971) task_wdt: CPU 1: ReceiveTask
E (1932971) task_wdt: Print CPU 1 backtrace

Backtrace: 0x4037875F:0x3FC9BCB0 0x40377295:0x3FC9BCD0 0x400559DD:0x3FCB7B10 0x4037E082:0x3FCB7B20 0x4037D841:0x3FCB7B40 0x4200E146:0x3FCB7B80 0x4200EA1B:0x3FCB7BA0 0x4200B055:0x3FCB7BD0 0x4037DDE9:0x3FCB7C00

…and the occasional dropped CAN frame:

I (1933011) ESP-MCP251XFD: RX FIFO Overflow. Sorry, but we lost at least one CAN frame

But then… after some more time, I got the following error and the system restarted:

RROR*** A stack overflow in task w5500_tsk has been detected.


Backtrace: 0x40375b5a:0x3fca6d60 0x4037d3ed:0x3fca6d80 0x4037e0fe:0x3fca6da0 0x4037f44f:0x3fca6e20 0x4037e230:0x3fca6e50 0x4037e226:0xa082f590 |<-CORRUPTED

I guess this did happen, because the w5500_tsk wanted to emit a warning or error message. I enlarged its stack size to 8192 and rerun the tested with 240MHz and QIO. Bandwidth slightly enlarged:

[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.5194 sec  11.6 MBytes  9.27 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 42884
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.4907 sec  12.0 MBytes  9.60 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 44268
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.2703 sec  11.8 MBytes  9.60 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 49958
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.3149 sec  11.8 MBytes  9.56 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 37384

I let this run for a while and launched a bunch of ping processes in addition, changed the iperf paralleliziation and even flood pinged the device, but couldn’t make it crash. Even if I completely overloaded it…

[ 25] 0.0000-20.3036 sec   289 KBytes   116 Kbits/sec
connect failed: Connection refused
recv failed: Resource temporarily unavailable
recv failed: Resource temporarily unavailable
recv failed: Resource temporarily unavailable
recv failed: Resource temporarily unavailable
recv failed: Resource temporarily unavailable
[ 52] 0.0000-21.7129 sec   102 KBytes  38.4 Kbits/sec
connect failed: Connection refused
recv failed: Resource temporarily unavailable
[ 16] 0.0000-23.2300 sec   525 KBytes   185 Kbits/sec
connect failed: Connection refused
recv failed: Resource temporarily unavailable
recv failed: Resource temporarily unavailable
[ 23] 0.0000-23.8224 sec   106 KBytes  36.5 Kbits/sec
connect failed: Connection refused

…it always recovered.

So besides that one crash beforehand, everything went really great. Since I still can produce the W5500 hangups with my full application though, it must be something else.

I’m afraid these tests didn’t help much for the bug report in question, although they increased my confidence that in general the hardware combination and the included drivers are really solid.

I guess it makes most sense to call it a day with regards to this bug report for now. When I opened it one year ago, it was referring to my work with jumper wires and a C6-devboard. Since I have different hardware now, I think it’s best to close this here and continue to inspect what my application is doing that might destabilize the W5500 and/or LWIP stack.

I will open or contribute to another issue report with further findings. Thanks for your attention so far!

mickeyl on Mar 16, 2024

If that can help, we use our products in production since 6 months now and I can confirm that spi_device_polling_transmit solved definitively the problem for us. Thanks to kostaond for his help.

ndedobbeleer on Mar 10, 2024

@kostaond

I was using spi_device_transmit .

I replaced all my functions with spi_device_polling_transmit and ran a test again. With this configuration, I did not encounter any problem in 16 hours.

I don’t need to specifically use spi_device_transmit. So that solves the problem in my case.

ndedobbeleer on Jun 1, 2023

@kostaond

I have push a project with minimal code to reproduce this error : #https://github.com/Stay-Info/EspW5500FailDemo

While testing this project, I realized that the problem did not appear if no other SPI device was used on the same bus.

For the example I used an MCP3462 ADC but I think spamming requests to any other SPI device will return the same result.

ndedobbeleer on May 18, 2023