zephyr: eth: stm32h747i_disco: sem timeout and hang on debug build

Describe the bug When building the dumb http server with CONFIG_DEBUG enabled, for the stm32h747i_disco_m7 board, we very fast get a semaphore timeout on waiting for transmission complete callback from ST HAL layer. We can see

[00:00:02.719,000] <err> eth_stm32_hal: HAL_ETH_TransmitIT tx_int_sem take timeout
[00:00:02.719,000] <err> eth_stm32_hal: eth packet timeout
98 fa 9b 39 67 d7 02 80  e1 67 51 79 08 00 45 00 |...9g... .gQy..E.
00 28 00 00 00 00 40 06  64 a8 a9 fe 61 15 a9 fe |.(....@. d...a...
61 16 1f 90 ed 02 04 81  fc b9 99 ef e4 5e 50 10 |a....... .....^P.
05 00 08 90 00 00                                |......           

Often it hangs and doesn’t recover. Build without CONFIG_DEBUG and it works flawless. Increasing semaphore timeout time doesn’t do any difference. What have you tried to diagnose or workaround this issue? With instruction cache disabled, it works flawlessly with CONFIG_DEBUG enabled. I managed to track it down to modules/hal/stm32/stm32cube/stm32h7xx/drivers/src/stm32h7xx_hal_eth.c: 2979. If we enable instruction cache before this line, timeout. If we enable it after, it works. If we insert a data barrier after, it works:

/* Mark it as LAST descriptor */
SET_BIT(dmatxdesc->DESC3, ETH_DMATXNDESCRF_LD);
__DSB();

I guess this is a fix, but really not that nice to be messing around in ST’s HAL, and I’m also wondering if this fix actually fixes a problem we are causing in the driver. We might be doing something wrong in our stm32h7 driver? Comparing with ST samples for STM32H743 it looks correct. I didn’t find anybody else having this issue with ST’s HAL on stm32h747i MCU. I’ve seen some people from ST contributing here, maybe somebody can take a look at this? Messing with buffer alignments doesn’t have any effect either; I tried alignment 256 bytes, with confirmation it was aligned. Please note that this issue was seen when making this driver as well (https://github.com/zephyrproject-rtos/zephyr/pull/27188#discussion_r475380472)

To Reproduce Steps to reproduce the behavior:

  1. add to /boards/arm/stm32h747i_disco/stm32h747i_disco.dtsi
&mac {
	status="okay";
	pinctrl-0 = <&eth_tx_en_pg11 &eth_txd1_pg12
				 &eth_txd0_pg13 &eth_mdc_pc1
				 &eth_mdio_pa2 &eth_ref_clk_pa1
				 &eth_crs_dv_pa7 &eth_rxd0_pc4
				 &eth_rxd1_pc5>;
};
  1. add to samples/net/sockets/dumb_http_server/prj.conf
CONFIG_NET_L2_ETHERNET=y
CONFIG_ETH_STM32_HAL=y
CONFIG_DEBUG=y
  1. west build -b stm32h747i_disco_m7 zephyr/samples/net/sockets/dumb_http_server/
  2. west flash
  3. Open COM port and do repeated http requests to devkit over ethernet, for example ab -n 100 -c 1 http://192.0.2.1:8080/
  4. The error will be printed in the terminal instantly, and it will hang subsequent transmissions

Expected behavior Ethernet tx complete semaphore should not timeout. Temporary fix proves this is possible.

Impact Showstopper if it hangs, which it appears to do. We realize it only happens when building for debug, but it’s not sustainable to not be able to debug properly. Something’s wrong

Logs and console output Please note I am using different IP’s than example.

*** Booting Zephyr OS build zephyr-v2.4.0-1314-gc3ac3027a17a  ***
[00:00:00.006,000] <inf> net_config: Initializing network
[00:00:00.006,000] <inf> net_config: Waiting interface 1 (0x24004ae4) to be up...
Single-threaded dumb HTTP server waits for a connection on port 8080...
[00:00:02.000,000] <inf> net_config: Interface 1 (0x24004ae4) coming up
[00:00:02.000,000] <inf> net_config: IPv4 address: 169.254.97.21
[00:00:02.991,000] <err> eth_stm32_hal: HAL_ETH_TransmitIT tx_int_sem take timeout
[00:00:02.991,000] <err> eth_stm32_hal: eth packet timeout
98 fa 9b 39 67 d7 02 80  e1 97 e2 22 08 06 00 01 |...9g... ..."....
08 00 06 04 00 02 02 80  e1 97 e2 22 a9 fe 61 15 |........ ..."..a.
98 fa 9b 39 67 d7 a9 fe  61 16                   |...9g... a.      
Connection #0 from 169.254.97.22
Connection from 169.254.97.22 closed

Environment (please complete the following information):

  • OS: Ubuntu 20.04
  • Toolchain Zephyr SDK
  • Commit SHA or Version used: c3ac3027a17a89fd30863db818ab320e5bf8b14e (v2.4.99)

Additional context Ethernet cable connected straight to computer

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 33 (21 by maintainers)

Commits related to this issue

Most upvoted comments

The error is okay for the first time. I think it is much more question to @jukkar . But we would leave it as it is for now.

The -62 error is ENETDOWN, and it is returned if the network interface is still down (from network stack point of view) when feeding a packet to net stack. So in practice the device driver is receiving packets even if upper layers think that the interface is still down. This is probably related to timings when the device starts up.

@emillindq Have you already created PR in stm32_hal repo? I can confirm, that __DSB() approach (no matter if in HAL or driver without CONFIG_DEBUG) increases stability with my board and samples. I have observed it with my latest CivetWeb update (currently as PR).

https://github.com/STMicroelectronics/STM32CubeH7/issues/95 I just did!

Thank you @emillindq and @Nukersson for figuring all of this out ! I don’t have much experience with cache on H7 series, but it looks like it might be an issue in the future if we have to sync CACHE data each time we are dealing with RAM descriptors for peripherals and not registers.

Seems like you found out that if the cache data is not in sync with RAM when calling DMA (which is accessed using registers this time, so no caching is done right ?), this can lead to mismatch in configuration data and in the end, Ethernet IP is not set to fire the TX complete interrupt.

I never thought that this would be a real problem until now so thanks for the discovery 👍

Also great thanks for the fix ! Will try it out as soon as possible.

To answer your last thought, I tried to wrap HAL_ETH_Transmit_IT() around both k_sched_(un)lock() and __dis(en)able_irq() , still with timeout and hang.

I ran the ab loop with test setup 3 for a longer time and it went on for almost 1 million connections without sem timeout or hang. I will make a PR with this change so we have something to look at.

I just saw this happen to me again with my fix removed and with latest stm32 HAL version, so it seems like it’s a lot better but not completely fixed.

@gmarull Sure, I created #30403

@Nukersson Thanks. I can’t reproduce the semaphore timeout anymore. Occasionally I get this error on startup:

<err> eth_stm32_hal: Failed to enqueue frame into RX queue: -62

This was also the case before the fix so I suspect it is related to another issue. But the ethernet runs fine afterwards.

@reloZid Does your described solution helped you?

@Nukersson We are currently evaluating various TCP/IP solutions and I just noticed it because others like the example from ST which uses FreeRTOS+LwIP go for the approach with SRAM3 and the MPU. At least this worked for me.

On Zephyr I tried moving the descriptors/buffers to SRAM3 by using a custom linker script and a custom sections file as done in the code relocation sample (but not relocating everything, just to have the sections defined for the buffers). Unfortunately this didn’t work out of the box. I suspect that the required MPU configuration is conflicting with what Zephyr is configuring like @erwango pointed out. If I find time I’ll try to give it another go.

@emillindq, does the initial fix in ST HAL that you mentioned earlier actually fixes the issue in all cases ?

If there’s an actual bug in the HAL driver, it is not excluded to fix it there directly. You can report the issue there: https://github.com/STMicroelectronics/STM32CubeH7 (you can also directly propose a PR to fix it). The only issue with putting fixes in Zephyr replication of HAL is that they need to be maintained, which becomes a heavy process as far as the list grows. If the issue is acknowledged in the official repo, we can put a temporary fix in the Zephyr HAL replication waiting the official delivery.

Thank you for the encouragement 😃

Edit: Also putting SCB_InvalidateICache() instead of __DSB() worked for me. I would prefer this solution even more.

At first sight it seems like SCB_InvalidateICache() works but I am able to get it to timeout, albeit a lot harder. It appears to only happen just after reset; on average 1/5 times. I am running ab test while I’m resetting and it happens just after network comes up. Increasing timeout to 100ms didn’t help. It’s not like that using __DSB() in the HAL. Trying with __DSB() after HAL_ETH_Transmit_IT makes it hang completely. The same is true for __ISB().

I ran four tests (hint: case 3 & 4 interesting) timeout default (20ms), test duration 5min, reset every 7s, running ab test continuously making GET requests**

  1. __DSB() in HAL. Result: About 700 requests made between reset, no hangs, no sem timeouts, 30 000 connections in total
  2. SCB_InvalidateICache() in eth_stm32_hal.c after HAL_ETH_Transmit_IT(). Result: About 700 requests between reset, 45 resets and ten sem timeouts, no hangs, about 30 000 connections in total. The timeout exclusively happens just after boot.
  3. __DSB() before and after HAL_ETH_Transmit_IT(). Result: Same result as 1.
  4. __ISB() before and __DSB() after HAL_ETH_Transmit_IT(). Result: Same result as 1.

With nothing after HAL_ETH_Transmit_IT(), always failure in all cases.

So now we have a solution (3 or 4) that doesn’t involve messing around in the HAL; however it just feels a bit ghetto. I don’t know why it suddenly works with a sync barrier before and after function call. Is this good enough?