esp-idf: W5500 fails after 20 minutes of operation (IDFGH-10018)
Answers checklist.
- I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
- I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
- I have searched the issue tracker for a similar issue and not found a similar issue.
IDF version.
v5.2-dev-321-ga8b6a70620
Operating System used.
Linux
How did you build your project?
VS Code IDE
If you are using Windows, please specify command line type.
None
Development Kit.
ESP32-C6-DevKit M1
Power Supply used.
USB
What is the expected behavior?
W5500 should be stable over a long time.
What is the actual behavior?
After 20 minutes, it stops operating. See debug log.
Steps to reproduce.
Use an ethernet example.
Debug Logs.
E (5516841) w5500.mac: emac_w5500_read_phy_reg(335): read PHY register failed
E (5516841) w5500.phy: w5500_update_link_duplex_speed(69): read PHYCFG failed
E (5516841) w5500.phy: w5500_get_link(112): update link duplex speed failed
E (5516901) w5500.mac: w5500_get_rx_received_size(152): read RX RSR failed
E (5518051) w5500.mac: w5500_get_rx_received_size(152): read RX RSR failed
### More Information.
I'm using the C6-Devkit w/ a W5500 connected via SPI.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 27 (8 by maintainers)
Thanks a lot. Unfortunately,
vTaskDelay’s minimum delay is one millisecond (w/CONFIG_FREERTOS_HZ=1000), and this is quite long, considering that at a (moderate) speed of 500KBit/s of CAN I might receive at least 4 frames during that time. At 1MBit/s, it could be 8 frames, se we can’t really sleep.That said, I have recently published esp-microsleep, which works around that issue.
On another note… applications like mine might be an interesting playground for FreeRTOS AMP – where time critical stuff (like CAN-FD in our case) happens on a non-FreeRTOS core.
As long as there are no sudden spikes of data that can swamp the thread that might be fine. Personally I’m a belts and suspenders kind of guy so I would likely have added a very short vTaskDelay just in case. 😁 Good luck on your project
@JimmyPedersen Yes, usually this is correct, but in this case it shouldn’t be necessary, because the receiver task is already waiting in a FreeRTOS-aware function for the next message from the queue. This becomes only a problem if the IRQs are coming in so fast that the system is starving in general – which isn’t a real problem in the field, since we rarely reach 70% bus load.
@kostaond
In an attempt to finally dive deeper into this issue, I took the weekend and carried out a lot of stress-tests with the following setup:
Our product is a custom ESP32S3-based board with a W5500 (SPI3_HOST) and an MCP2518fd CAN controller (SPI2_HOST). The design follows the product recommendations from Wiznet and Microchip very closely and it has been done by an experienced EE, so I’m pretty sure there are no hardware issues. The test software configures the SPI peripherals bus speeds for W5500 at 20MHz and the MCP2518fd at 16MHz (slightly lower than the supported maximum of 17.5MHz). The software is compiled with
-Osand a log level ofI.The application on the device under test is based on
examples/network/bridge, where WiFi and the W5500 Ethernet is combined to one virtual interface running a DHCP server. The MCP2518fd driver simply echoes all the frames it receives back on the CAN bus.To test the performance, there are three auxilary systems attached:
A Raspberry Pi 4 with a CAN (gs_usb) attached to the MCP2518fd. The bus is properly terminated. This machine sends random can-bus frames very quickly using
cangen can0 -g 0.45. This leads to a 70% bus load (measured withcanbusload -cbr can0@500000), which is almost saturating the ISR.A Linux PC (Thinkpad X1 Carbon 2021), which connects to the ESP32S3 via WiFi. It runs the client part of
iperfendlessly.I let this run for several hours, at first with a CPU speed of 160MHz and DIO. Without CAN traffic I’m getting the following speeds:
With CAN traffic I’m getting the following speeds:
The slight drop in performance is probably a factor of the higher systemload, so I’m fine with it. For a while this went ok, including the occasional watchdog complaint…
…and the occasional dropped CAN frame:
But then… after some more time, I got the following error and the system restarted:
I guess this did happen, because the
w5500_tskwanted to emit a warning or error message. I enlarged its stack size to 8192 and rerun the tested with 240MHz and QIO. Bandwidth slightly enlarged:I let this run for a while and launched a bunch of ping processes in addition, changed the iperf paralleliziation and even flood pinged the device, but couldn’t make it crash. Even if I completely overloaded it…
…it always recovered.
So besides that one crash beforehand, everything went really great. Since I still can produce the W5500 hangups with my full application though, it must be something else.
I’m afraid these tests didn’t help much for the bug report in question, although they increased my confidence that in general the hardware combination and the included drivers are really solid.
I guess it makes most sense to call it a day with regards to this bug report for now. When I opened it one year ago, it was referring to my work with jumper wires and a C6-devboard. Since I have different hardware now, I think it’s best to close this here and continue to inspect what my application is doing that might destabilize the W5500 and/or LWIP stack.
I will open or contribute to another issue report with further findings. Thanks for your attention so far!
If that can help, we use our products in production since 6 months now and I can confirm that
spi_device_polling_transmitsolved definitively the problem for us. Thanks to kostaond for his help.@kostaond
I was using
spi_device_transmit.I replaced all my functions with
spi_device_polling_transmitand ran a test again. With this configuration, I did not encounter any problem in 16 hours.I don’t need to specifically use
spi_device_transmit. So that solves the problem in my case.@kostaond
I have push a project with minimal code to reproduce this error : #https://github.com/Stay-Info/EspW5500FailDemo
While testing this project, I realized that the problem did not appear if no other SPI device was used on the same bus.
For the example I used an MCP3462 ADC but I think spamming requests to any other SPI device will return the same result.