zephyr: eth: stm32h747i_disco: sem timeout and hang on debug build
Describe the bug
When building the dumb http server with CONFIG_DEBUG enabled, for the stm32h747i_disco_m7 board, we very fast get a semaphore timeout on waiting for transmission complete callback from ST HAL layer. We can see
[00:00:02.719,000] <err> eth_stm32_hal: HAL_ETH_TransmitIT tx_int_sem take timeout
[00:00:02.719,000] <err> eth_stm32_hal: eth packet timeout
98 fa 9b 39 67 d7 02 80 e1 67 51 79 08 00 45 00 |...9g... .gQy..E.
00 28 00 00 00 00 40 06 64 a8 a9 fe 61 15 a9 fe |.(....@. d...a...
61 16 1f 90 ed 02 04 81 fc b9 99 ef e4 5e 50 10 |a....... .....^P.
05 00 08 90 00 00 |......
Often it hangs and doesn’t recover. Build without CONFIG_DEBUG and it works flawless. Increasing semaphore timeout time doesn’t do any difference.
What have you tried to diagnose or workaround this issue?
With instruction cache disabled, it works flawlessly with CONFIG_DEBUG enabled. I managed to track it down to
modules/hal/stm32/stm32cube/stm32h7xx/drivers/src/stm32h7xx_hal_eth.c: 2979. If we enable instruction cache before this line, timeout. If we enable it after, it works. If we insert a data barrier after, it works:
/* Mark it as LAST descriptor */
SET_BIT(dmatxdesc->DESC3, ETH_DMATXNDESCRF_LD);
__DSB();
I guess this is a fix, but really not that nice to be messing around in ST’s HAL, and I’m also wondering if this fix actually fixes a problem we are causing in the driver. We might be doing something wrong in our stm32h7 driver? Comparing with ST samples for STM32H743 it looks correct. I didn’t find anybody else having this issue with ST’s HAL on stm32h747i MCU. I’ve seen some people from ST contributing here, maybe somebody can take a look at this? Messing with buffer alignments doesn’t have any effect either; I tried alignment 256 bytes, with confirmation it was aligned. Please note that this issue was seen when making this driver as well (https://github.com/zephyrproject-rtos/zephyr/pull/27188#discussion_r475380472)
To Reproduce Steps to reproduce the behavior:
- add to
/boards/arm/stm32h747i_disco/stm32h747i_disco.dtsi
&mac {
status="okay";
pinctrl-0 = <ð_tx_en_pg11 ð_txd1_pg12
ð_txd0_pg13 ð_mdc_pc1
ð_mdio_pa2 ð_ref_clk_pa1
ð_crs_dv_pa7 ð_rxd0_pc4
ð_rxd1_pc5>;
};
- add to
samples/net/sockets/dumb_http_server/prj.conf
CONFIG_NET_L2_ETHERNET=y
CONFIG_ETH_STM32_HAL=y
CONFIG_DEBUG=y
west build -b stm32h747i_disco_m7 zephyr/samples/net/sockets/dumb_http_server/- west flash
- Open COM port and do repeated http requests to devkit over ethernet, for example
ab -n 100 -c 1 http://192.0.2.1:8080/ - The error will be printed in the terminal instantly, and it will hang subsequent transmissions
Expected behavior Ethernet tx complete semaphore should not timeout. Temporary fix proves this is possible.
Impact Showstopper if it hangs, which it appears to do. We realize it only happens when building for debug, but it’s not sustainable to not be able to debug properly. Something’s wrong
Logs and console output Please note I am using different IP’s than example.
*** Booting Zephyr OS build zephyr-v2.4.0-1314-gc3ac3027a17a ***
[00:00:00.006,000] <inf> net_config: Initializing network
[00:00:00.006,000] <inf> net_config: Waiting interface 1 (0x24004ae4) to be up...
Single-threaded dumb HTTP server waits for a connection on port 8080...
[00:00:02.000,000] <inf> net_config: Interface 1 (0x24004ae4) coming up
[00:00:02.000,000] <inf> net_config: IPv4 address: 169.254.97.21
[00:00:02.991,000] <err> eth_stm32_hal: HAL_ETH_TransmitIT tx_int_sem take timeout
[00:00:02.991,000] <err> eth_stm32_hal: eth packet timeout
98 fa 9b 39 67 d7 02 80 e1 97 e2 22 08 06 00 01 |...9g... ..."....
08 00 06 04 00 02 02 80 e1 97 e2 22 a9 fe 61 15 |........ ..."..a.
98 fa 9b 39 67 d7 a9 fe 61 16 |...9g... a.
Connection #0 from 169.254.97.22
Connection from 169.254.97.22 closed
Environment (please complete the following information):
- OS: Ubuntu 20.04
- Toolchain Zephyr SDK
- Commit SHA or Version used: c3ac3027a17a89fd30863db818ab320e5bf8b14e (v2.4.99)
Additional context Ethernet cable connected straight to computer
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 33 (21 by maintainers)
Commits related to this issue
- ethernet: stm32h7: fix tx semaphore timeout bug Incomplete memory write causes issues with ethernet transmission on stm32h747_disco_m7 board. TxCpltCallback is never called, causing waiting for tx_in... — committed to emillindq/zephyr by emillindq 4 years ago
- drivers: ethernet: stm32: SRAM3 / MPU configuration Fixes #29915. Implements the memory layout and MPU configuration for Ethernet buffers for STM32H7 controllers as recommended by ST. 16 KB of SRAM3... — committed to mjaun/zephyr by mjaun 3 years ago
- drivers: ethernet: stm32: SRAM3 / MPU configuration Fixes #29915. Implements the memory layout and MPU configuration for Ethernet buffers for STM32H7 controllers as recommended by ST. 16 KB of SRAM3... — committed to zephyrproject-rtos/zephyr by mjaun 3 years ago
- drivers: ethernet: stm32: SRAM3 / MPU configuration Fixes #29915. Implements the memory layout and MPU configuration for Ethernet buffers for STM32H7 controllers as recommended by ST. 16 KB of SRAM3... — committed to fancom/zephyr by mjaun 3 years ago
- ethernet: stm32h7: Move DMA buffers from sram3 to sram2 suffer from #29915. All H7 variants have sram2, so lets use that instead. Signed-off-by: Björn Stenberg <bjorn@haxx.se> — committed to zagor/zephyr by zagor a year ago
- ethernet: stm32h7: Move DMA buffers from sram3 to sram2 PR #30403 implemented nocache regions for ethernet DMA buffers in sram3. Unfortunately, chip variants STM32H742xx do not have any sram, so they... — committed to zagor/zephyr by zagor a year ago
- ethernet: stm32h7: Move DMA buffers from sram3 to sram2 PR #30403 implemented nocache regions for ethernet DMA buffers in sram3 to fix issue #29915. Unfortunately, chip variants STM32H742xx do not ha... — committed to zagor/zephyr by zagor a year ago
- ethernet: stm32h7: Move DMA buffers from sram3 to sram2 PR #30403 implemented nocache regions for ethernet DMA buffers in sram3 to fix issue #29915. Unfortunately, chip variants STM32H742xx do not ha... — committed to zagor/zephyr by zagor a year ago
- ethernet: stm32h7: Move DMA buffers from sram3 to sram2 PR #30403 implemented nocache regions for ethernet DMA buffers in sram3 to fix issue #29915. Unfortunately, chip variants STM32H742xx do not ha... — committed to zagor/zephyr by zagor a year ago
- ethernet: stm32h7: Move DMA buffers from sram3 to sram2 PR #30403 implemented nocache regions for ethernet DMA buffers in sram3 to fix issue #29915. Unfortunately, some STM32H7 variants do not have a... — committed to zagor/zephyr by zagor a year ago
- ethernet: stm32h7: Move DMA buffers from sram3 to sram2 PR #30403 implemented nocache regions for ethernet DMA buffers in sram3 to fix issue #29915. Unfortunately, some STM32H7 variants do not have a... — committed to zagor/zephyr by zagor a year ago
- ethernet: stm32h7: Move DMA buffers from sram3 to sram2 PR #30403 implemented nocache regions for ethernet DMA buffers in sram3 to fix issue #29915. Unfortunately, some STM32H7 variants do not have a... — committed to zephyrproject-rtos/zephyr by zagor a year ago
The -62 error is ENETDOWN, and it is returned if the network interface is still down (from network stack point of view) when feeding a packet to net stack. So in practice the device driver is receiving packets even if upper layers think that the interface is still down. This is probably related to timings when the device starts up.
https://github.com/STMicroelectronics/STM32CubeH7/issues/95 I just did!
Thank you @emillindq and @Nukersson for figuring all of this out ! I don’t have much experience with cache on H7 series, but it looks like it might be an issue in the future if we have to sync CACHE data each time we are dealing with RAM descriptors for peripherals and not registers.
Seems like you found out that if the cache data is not in sync with RAM when calling DMA (which is accessed using registers this time, so no caching is done right ?), this can lead to mismatch in configuration data and in the end, Ethernet IP is not set to fire the TX complete interrupt.
I never thought that this would be a real problem until now so thanks for the discovery 👍
Also great thanks for the fix ! Will try it out as soon as possible.
To answer your last thought, I tried to wrap
HAL_ETH_Transmit_IT()around bothk_sched_(un)lock()and__dis(en)able_irq(), still with timeout and hang.I ran the
abloop with test setup 3 for a longer time and it went on for almost 1 million connections without sem timeout or hang. I will make a PR with this change so we have something to look at.I just saw this happen to me again with my fix removed and with latest stm32 HAL version, so it seems like it’s a lot better but not completely fixed.
@gmarull Sure, I created #30403
@Nukersson Thanks. I can’t reproduce the semaphore timeout anymore. Occasionally I get this error on startup:
This was also the case before the fix so I suspect it is related to another issue. But the ethernet runs fine afterwards.
@Nukersson We are currently evaluating various TCP/IP solutions and I just noticed it because others like the example from ST which uses FreeRTOS+LwIP go for the approach with SRAM3 and the MPU. At least this worked for me.
On Zephyr I tried moving the descriptors/buffers to SRAM3 by using a custom linker script and a custom sections file as done in the code relocation sample (but not relocating everything, just to have the sections defined for the buffers). Unfortunately this didn’t work out of the box. I suspect that the required MPU configuration is conflicting with what Zephyr is configuring like @erwango pointed out. If I find time I’ll try to give it another go.
@emillindq, does the initial fix in ST HAL that you mentioned earlier actually fixes the issue in all cases ?
If there’s an actual bug in the HAL driver, it is not excluded to fix it there directly. You can report the issue there: https://github.com/STMicroelectronics/STM32CubeH7 (you can also directly propose a PR to fix it). The only issue with putting fixes in Zephyr replication of HAL is that they need to be maintained, which becomes a heavy process as far as the list grows. If the issue is acknowledged in the official repo, we can put a temporary fix in the Zephyr HAL replication waiting the official delivery.
Thank you for the encouragement 😃
At first sight it seems like
SCB_InvalidateICache()works but I am able to get it to timeout, albeit a lot harder. It appears to only happen just after reset; on average 1/5 times. I am runningabtest while I’m resetting and it happens just after network comes up. Increasing timeout to 100ms didn’t help. It’s not like that using__DSB()in the HAL. Trying with__DSB()afterHAL_ETH_Transmit_ITmakes it hang completely. The same is true for__ISB().I ran four tests (hint: case 3 & 4 interesting) timeout default (20ms), test duration 5min, reset every 7s, running
abtest continuously making GET requests**__DSB()in HAL. Result: About 700 requests made between reset, no hangs, no sem timeouts, 30 000 connections in totalSCB_InvalidateICache()in eth_stm32_hal.c afterHAL_ETH_Transmit_IT(). Result: About 700 requests between reset, 45 resets and ten sem timeouts, no hangs, about 30 000 connections in total. The timeout exclusively happens just after boot.__DSB()before and afterHAL_ETH_Transmit_IT(). Result: Same result as 1.__ISB()before and__DSB()afterHAL_ETH_Transmit_IT(). Result: Same result as 1.With nothing after
HAL_ETH_Transmit_IT(), always failure in all cases.So now we have a solution (3 or 4) that doesn’t involve messing around in the HAL; however it just feels a bit ghetto. I don’t know why it suddenly works with a sync barrier before and after function call. Is this good enough?