zephyr: subsys/mgmt/hawkbit: Unable to finish download if CPU blocking function (i.e. `flash_img_buffered_write`) is used

Update 2

The upgrade works consistently after I enabled SPI async API as that offload the writing of the firmware from the CPU to the DMA, freeing the CPU to handle the TCP packets.

Update 1

CPU blocking functions such as flash_img_buffered_write() or log immediate mode can cause this issue. Happens to the big_http_download sample as well.

Describe the bug I’m currently trying the hawkbit sample for the OTA on my custom board. For some reason the download process is extremely long and the gap between incoming data is increasingly long. The first handful chunks of data is normal, after that the next chunk would arrive at ~30s later, then ~1 min later, then almost exactly about 2 mins later (like there’s a pattern) and eventually it would fail after downloading about 20kB (it usually fails at exactly 20 kB) or after ~17 minutes.

After trying a few things, I found that the download is slow only if it tries to write the downloaded buffer into the flash here. If I comment that line out, the download is actually pretty fast and typically managed to download everything (251kB) under 40 seconds. If I replace that line with a k_msleep(60), it also downloads just as fast.

To determine the time flash_img_buffered_write() took to finish, I tried to calculate the elapsed time using k_uptime_get() before and after the function call, and the log system tells me that it took 6-41ms. And now I’m pretty much lost and not sure how to debug this and get it working.

From my current understanding, the hawkbit will do a http GET request to the download link using the http_client_req then the connection/socket should be opened and the image chunks will be pushed continuously from the server without requiring ack from the client (I could be wrong here). The hawkbit client will simply write the received data into the flash until the HTTP_DATA_FINAL is received. I don’t know how a function that literally returned immediately can affect this download process.

The setup that I use:

STM32G0B1RE custom board
Quectel EC21 LTE modem (gsm_ppp driver)
image-0 partition is 416 kB in the internal flash
image-1 partition is 416 kB in the external spi-nor flash

To Reproduce Steps to reproduce the behavior:

Build and flash hawkbit sample

Expected behavior I expect the download to finish under 40 seconds.

Impact Unable to use hawkbit

Logs and console output hawkbit_log.txt

Environment (please complete the following information):

OS: Windows 10
Toolchain: GNUARMEMB
v2.6.0-rc2-107-gebe282b02d9f
- I applied this commit manually.

Extra context

This is the log file when I replace the flash_img_buffered_write() with k_msleep(60): hawkbit_test_log.txt

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 32 (6 by maintainers)

Most upvoted comments

I believe this issue is now solved by #43018 and/or #39275. I’m able to download a 300+kB firmware and write that into flash using SPI interrupt API. Previously this is only possible using SPI DMA.

ycsin on Mar 4, 2022

@ycsin Having HW flow control is quite critical, as it will suspend sending/receiving data stream when the other end is busy (e.g. processing previous packet). Without that, part of the frame can be easily dropped and there is no easy way to recover from that.

mniestroj on Nov 23, 2021

I workaround this by enabling SPI async, I wonder if #39275 would be able to fix this for SOCs that doesnt support SPI async by limiting the download speed to what the CPU can gracefully process?

After testing that PR a bit it turned out that recv window handling will require a bit more effort to implement than proposed in the PR. So this is still an open topic.

rlubos on Oct 14, 2021

What is the trade off if I continue to increase this delay?

I think it depends on the driver/modem really, if packets keep coming and they are not consumed due to lack of RX buffer, the modem will eventually drop them. It could be the case that the respective driver/modem has buffering capabilities on it’s own, but they’re not unlimited.

So if we’re not talking about sporadic CPU business causing delay, but a general CPU overload we should rather think of limiting the data flow at TCP level. Typically, you can limit the amount of data sent by the server by decreasing the Recieve Window size sent in the ACK messge. Unfortunately, from what I’ve seen, TCP2 implementation does not implement the feature, and sends a fixed value of IPv6 MTU all the time, effectively giving no limits to the server. I suggest to open an enhancement issue for the feature, as IMO it’s quite valuable for constrained devices, and it seems to be a regressinon compared to the previous TCP implementation (where Recieve window handling was implemented).

rlubos on Oct 5, 2021

Maybe worth to consider moving calls to flash_img_buffer_write to other thread with lower priority than image collection thread, or to the worqueue. (this maight be done in the flash steram module as well).As image you are collecting is in external flash this should does the job.

nvlsianpu on Sep 14, 2021

@ycsin From the networking point of view, it could be the case, that due to increased CPU consumption the application thread is not able to consume the data fast enough, so we fill up all of the RX buffers. If the low-level network driver does not specify a timeout for the allocation, or the timeout is too small, it’d drop the packet, which would result in retransmission at TCP level and decreased performance.

We had a similar issue with one of the Ethernet drivers not long ago (see https://github.com/zephyrproject-rtos/zephyr/issues/36891#issuecomment-881503651) - the application did not catch up with the incoming traffic, which resulted in a packet drop. Adding a timeout for the net_pkt_rx_alloc_with_buffer() function in the Ethernet driver solved the problem.

Now, I’m not very familiar with the PPP implementation, but there seems to be a similar case with the ppp driver. Probalby worth checking if it’s related?

rlubos on Sep 13, 2021