harvester: [BUG] Uploading qcow2 image from local file fails with Context Canceled

Describe the bug

Trying to convert VMs from VMware ESXi to Harvester by converting vmdk-files to qcow2 and upload the disk images to Harvester

To Reproduce

  • Shut down VM in VMware
  • scp vmdk-files to local Linux machine
  • qemu-img convert -f vmdk -O qcow2 image_name.vmdk image_name.qcow2 -p
  • Go to Images tab
  • Click Create button
  • Upload image file from a local file image image image
  • Wait for it to complete…
  • Stops with following error image

Expected behavior

The qcow2 files should upload successfully

Environment

  • Harvester ISO version: v1.1.1
  • Underlying infrastructure: Baremetal with Dell PowerEdge R450

Support bundle Have created a support bundle and can provide it if needed

Additional context Add any other context about the problem here. The system is running in a semi-restricted environment and can only access internet-resources through a proxy that I have no control over.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 59 (27 by maintainers)

Most upvoted comments

Status Update This issue has been resolved by PR #3910, we could check it with self-build ISO including PR #3910 or master release since 2023/5/19

Root Cause and Solution This issue is brought about by the following reason:

  1. During uploading a big image file, the checksum calculation is blocking in LH side. Once the blocking is more than 60 seconds (Nginx default setting), Nginx thinks the connection is dead, and as the Virtual Image controller found this error, it stops updating the status, even if the LH backend is still working. This could be resolved by LH PR #105 by changing the checksum calculation into nonblocking mode.

Verification Steps

  • Prerequisites: Create a 70GB qcow2 file
    • $ qemu-img create -o preallocation=full -f qcow2 70g.qcow2 70G
  • Failed Case
    • Build Harvester with 1.1.2 or master release before 2023/5/18 (included)
    • Upload the 70GB qcow2 image
    • There should be an upload fail with “Context Canceled” img_upload_fail
  • Successful Case
    • Build Harvester with self-build ISO including PR #3910 or master release since 2023/05/19
    • Upload the 70GB qcow2 image
    • VM Image upload success should be found img_upload_success

It seems, when WebUI shows 99% Context cacelled, the backingimage is still moving, and it reachs 100%, but there are https proxy error in longhorn-manager-6df7d log, it is the root cause at the moment.

2023/02/21 22:12:09 http: proxy error: context canceled

192.168.122.1 - - [21/Feb/2023:22:06:44 +0000] "POST /v1/backingimages/default-image-pdp4k?action=upload&size=51547734016 HTTP/1.1" 502 0 "https://192.168.122.206/dashboard/harvester/c/local/harvesterhci.io.virtualmachineimage/create" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"

image

kk -n longhorn-system logs longhorn-manager-6df7d

time="2023-02-21T22:06:38Z" level=debug msg="Start monitoring" backingImageManager=backing-image-manager-d7ad-08b1 controller=longhorn-backing-image-manager diskUUID=08b16349-6d0e-46c9-b047-cbc5d3eab9a9 node=harv2 nodeID=harv2
time="2023-02-21T22:06:40Z" level=info msg="Start monitoring" backingImageDataSource=default-image-pdp4k controller=longhorn-backing-image-data-source diskUUID=08b16349-6d0e-46c9-b047-cbc5d3eab9a9 node=harv2 nodeID=harv2 parameters="map[]" sourceType=upload
time="2023-02-21T22:06:40Z" level=warning msg="Stop monitoring since backing image data source default-image-pdp4k current IP is empty" backingImageDataSource=default-image-pdp4k controller=longhorn-backing-image-data-source diskUUID=08b16349-6d0e-46c9-b047-cbc5d3eab9a9 node=harv2 nodeID=harv2 parameters="map[]" sourceType=upload
time="2023-02-21T22:06:44Z" level=info msg="Forwarding backing image upload request to URL http://10.52.0.33:8000/v1/file?action=upload&size=51547734016"
W0221 22:11:13.303607       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
time="2023-02-21T22:11:21Z" level=error msg="failed to get default-image-pdp4k info from backing image data source server: get failed, err: Get \"http://10.52.0.33:8000/v1/file\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" backingImageDataSource=default-image-pdp4k controller=longhorn-backing-image-data-source diskUUID=08b16349-6d0e-46c9-b047-cbc5d3eab9a9 node=harv2 nodeID=harv2 parameters="map[]" sourceType=upload
time="2023-02-21T22:11:34Z" level=error msg="failed to get default-image-pdp4k info from backing image data source server: get failed, err: Get \"http://10.52.0.33:8000/v1/file\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" backingImageDataSource=default-image-pdp4k controller=longhorn-backing-image-data-source diskUUID=08b16349-6d0e-46c9-b047-cbc5d3eab9a9 node=harv2 nodeID=harv2 parameters="map[]" sourceType=upload
time="2023-02-21T22:11:47Z" level=error msg="failed to get default-image-pdp4k info from backing image data source server: get failed, err: Get \"http://10.52.0.33:8000/v1/file\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" backingImageDataSource=default-image-pdp4k controller=longhorn-backing-image-data-source diskUUID=08b16349-6d0e-46c9-b047-cbc5d3eab9a9 node=harv2 nodeID=harv2 parameters="map[]" sourceType=upload
W0221 22:11:48.304447       1 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob
time="2023-02-21T22:12:00Z" level=error msg="failed to get default-image-pdp4k info from backing image data source server: get failed, err: Get \"http://10.52.0.33:8000/v1/file\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" backingImageDataSource=default-image-pdp4k controller=longhorn-backing-image-data-source diskUUID=08b16349-6d0e-46c9-b047-cbc5d3eab9a9 node=harv2 nodeID=harv2 parameters="map[]" sourceType=upload

2023/02/21 22:12:09 http: proxy error: context canceled

192.168.122.1 - - [21/Feb/2023:22:06:44 +0000] "POST /v1/backingimages/default-image-pdp4k?action=upload&size=51547734016 HTTP/1.1" 502 0 "https://192.168.122.206/dashboard/harvester/c/local/harvesterhci.io.virtualmachineimage/create" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"


time="2023-02-21T22:12:13Z" level=error msg="failed to get default-image-pdp4k info from backing image data source server: get failed, err: Get \"http://10.52.0.33:8000/v1/file\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" backingImageDataSource=default-image-pdp4k controller=longhorn-backing-image-data-source diskUUID=08b16349-6d0e-46c9-b047-cbc5d3eab9a9 node=harv2 nodeID=harv2 parameters="map[]" sourceType=upload
time="2023-02-21T22:12:23Z" level=debug msg="Start to fetch the data source file from the backing image data source work directory /tmp/" backingImage=default-image-pdp4k backingImageManager=backing-image-manager-d7ad-08b1 controller=longhorn-backing-image-manager diskUUID=08b16349-6d0e-46c9-b047-cbc5d3eab9a9 node=harv2 nodeID=harv2
time="2023-02-21T22:12:23Z" level=debug msg="Fetched the first file from BackingImageDataSource" backingImage=default-image-pdp4k backingImageManager=backing-image-manager-d7ad-08b1 controller=longhorn-backing-image-manager diskUUID=08b16349-6d0e-46c9-b047-cbc5d3eab9a9 node=harv2 nodeID=harv2
time="2023-02-21T22:12:23Z" level=info msg="Event(v1.ObjectReference{Kind:\"BackingImageManager\", Namespace:\"longhorn-system\", Name:\"backing-image-manager-d7ad-08b1\", UID:\"aa52a406-7829-4c6b-97a6-13a91bf73059\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"530907\", FieldPath:\"\"}): type: 'Normal' reason: 'Fetching' Fetched the first file for backing image default-image-pdp4k in disk 08b16349-6d0e-46c9-b047-cbc5d3eab9a9 on node harv2"
time="2023-02-21T22:12:26Z" level=info msg="Default backing image manager successfully took over the file prepared by the backing image data source, will mark the data source as file transferred" backingImageName=default-image-pdp4k controller=longhorn-backing-image node=harv2
time="2023-02-21T22:12:26Z" level=info msg="Stopping monitoring" backingImageDataSource=default-image-pdp4k controller=longhorn-backing-image-data-source node=harv2
time="2023-02-21T22:12:26Z" level=info msg="Stopped monitoring" backingImageDataSource=default-image-pdp4k controller=longhorn-backing-image-data-source node=harv2
time="2023-02-21T22:12:26Z" level=info msg="Start to clean up pod for backing image data source" backingImageDataSource=default-image-pdp4k controller=longhorn-backing-image-data-source diskUUID=08b16349-6d0e-46c9-b047-cbc5d3eab9a9 node=harv2 nodeID=harv2 parameters="map[]" sourceType=upload
time="2023-02-21T22:12:26Z" level=warning msg="No monitor goroutine for stopping" backingImageDataSource=default-image-pdp4k controller=longhorn-backing-image-data-source node=harv2
harv2:~ # 

@mwilberg Please do not try at the moment, there is a long path between:

Web Browser ----------------Harvester/Rancher/NGINX ---- LH ---- datasource

I will check how & where to adjust the NGINX param.

When it is tested, will let you know how to play with the workaround.

@ChanYiLin I would suggest you to setup an reproducable ENV first, it will help both debugging and coding fix. thanks.

A short update: the root cause comes from the combination of NGINX and datasource-manager,

Harvester cluster (with embedded Rancher) uses a NGINX to forward/proxy user request to backend, e.g., the UPLOAD action. NGINX will trigger timeout when READ/WRITE (two continous packet) interval > 60s.

			proxy_send_timeout                      60s;
			proxy_read_timeout                      60s;

The backimg-image-manager sets below Handler. router.HandleFunc("/v1/files", service.UploadFromRequest).Methods("POST").Queries("action", "upload")

Following code, for BIG image, the GetFileChecksum works more than 60 seconds,

LH (sf *SyncingFile) finishProcessingNoLock image

that’s why, the WEBUI shows context cancelled, but LH backend is still working, and it finally reachs 100%, it is blocking waiting the checksum computing, but NGINX thinks it’s dead. the code comment said // This operation may be time-consuming.

image

and, the issue is always hitted by BIG image uploading, failed at 99%.

the log here (https://github.com/harvester/harvester/issues/3450#issuecomment-1438256484) also tell us, the datasource did not print all logs in function finishProcessingNoLock

solution: simply speaking, the GetFileChecksum should run in a go routine, to avoid block the current go routine, and cause NGINX proxy time out.

workaround: increase the timeout in NGINX param

more details will be added later

Hi @w13915984028 thanks for all the logs

From Longhorn perspective, according to the log, the backing image was uploaded successfully and became ready as you can see from manager, it fetched the file from temp to final path

...Start to fetch the data source file from the backing image data source work directory /tmp/" backingImage=default-image-pdp4k...

and also cleanup the data source pod after the transfer was finished

...time="2023-02-21T22:12:26Z" level=info msg="Start to clean up pod for backing image data source" backingImageDataSource=default-image-pdp4k controller=longhorn-backing-image-data-source diskUUID=08b16349-6d0e-46c9-b047-cbc5d3eab9a9 node=harv2 ...

the BackingImage CR status also showed 100% ready

Can you check from longhorn UI to see if the image is ready?


longhorn-manager showed

http: proxy error: context canceled

meant the client connecting to longhorn-manager was gone.

Could it be that the virtual machine image controller in harvester tried to get the info from longhorn manager and somehow broke the connection?

So the state in harvester is not sync with longhorn?

Tested in v1.2.0-rc1, verified as fixed.

Test Environment

  • Harvester v1.2.0-rc1
    • profile: KVM, single node
    • ui-source: Auto
    • Longhorn v1.4.2
  • Browser: Firefox 102.11.0esr

Verification Steps

Get into Images tab and test step 1~4 for twice

  1. Upload a 70G qcow2 via Create button

  2. Wait and confirm status becomes Active ✔️

  3. Upload a 75G qcow2 via Create button

  4. Wait and confirm status becomes Active ✔️

    1st Round

    hv120rc1-ok-1of2

    2nd Round

    hv120rc1-ok-2of2

    Longhorn’s Backing Image

    hv120rc1-ok-2of2-lh

Note

  1. We can reproduce the upload fail due to gate way timeout mentioned in this ticket in Harvester v1.1.2

    hv112-fail-504

  2. Another known case of upload fail is due to ERR_NETWORK_CHANGED on Chromium based browers, but it goes beyond the Harvester scope, so notes here. Ref. https://stackoverflow.com/questions/44854296/neterr-network-changed-on-chrome-on-https-socket-connection

    chromium_based-err

After discussion with @FrankYang0529 to further understand Harvester system, there is more finding about Why Harvester UI still showed Failed with context cancel when BackingImage became ready in the end ?

It is because there is an defer function for error handling in Harvester api code https://github.com/harvester/harvester/blob/ce5aff93c04740723244f83c527d1297a97877e3/pkg/api/image/formatter.go#L143-L149 This line will update the virtualMachineImage Imported condition status to False with message context cancel

And in virtualMachineImage controller where it syncs status from Backingimage https://github.com/harvester/harvester/blob/ce5aff93c04740723244f83c527d1297a97877e3/pkg/controller/master/image/backing_image_controller.go#L40 If the Imported condition status is False it will not sync with Backingimage and consider it as final status

So even BackingImage kept calculating the checksum and update status to ready after the timeout broke the connection, Harvester had no chance to update the status back to ready because it already considered it failed when api broke.

cc @w13915984028 cc @innobead @shuo-wu

@mwilberg please make sure you change & reload NGINX master process in all rke2-ingress-nginx-* POD with workaround step (4.2) notify master to relad worker : kill -HUP 32, which is executed inside the POD.

The following logs show: start checksum computing at 2023-02-26T18:54:12 finish checksum computing at 2023-02-26T18:57:45Z, it is about 213s in your environment, and the http: proxy error: context canceled still happend at 18:56:21, it is 129s after checksum computing.

time="2023-02-26T18:54:12Z" level=debug msg="SyncingFile: failed to get the checksum from a valid config during processing wrap-up, will directly calculated it then"
2023/02/26 18:56:21 http: proxy error: context canceled
time="2023-02-26T18:57:45Z" level=debug msg="SyncingFile: writing config file when the file becomes ready" component=sync-file currentChecksum=29ad8d683f5467571598ed7699ce10608e2d056418b1697dbe4c48a1d8fda922c527fd66e42360d03ba613451503ebf0ec0b7b6d6b66faeaca5883652dcf2b3f filePath=/data/tmp/tanngarden-image-fttzh-365b0e4a size=75865718784 uuid=365b0e4a

As per several reverse proxy are used internally, I am not sure, after you set proxy-read/write-timeout to 360s in NGINX, which one in the communication path first timedout. needs further investigation.

In my Env, the 120s timeout works for the 45Gb img file cheksum computing, the computing takes about 70s.

@mwilberg Please do not try at the moment, there is a long path between:

Web Browser ----------------Harvester/Rancher/NGINX ---- LH ---- datasource

I will check how & where to adjust the NGINX param.

When it is tested, will let you know how to play with the workaround.

@ChanYiLin I would suggest you to setup an reproducable ENV first, it will help both debugging and coding fix. thanks.

Sorry didn’t see this one before my previous post 😃

Hi @mwilberg I am also surveying the issue. According to your backing-image-ds-* pod, the last few lines are

time="2023-02-15T17:16:52Z" level=info msg="SyncingFile: start to copy data to sync file" component=sync-file filePath=/data/tmp/tanngarden-image-8n59l-8f1402ce uuid=8f1402ce
time="2023-02-15T17:27:54Z" level=debug msg="SyncingFile: failed to get the checksum from a valid config during processing wrap-up, will directly calculated it then"
2023/02/15 17:28:54 http: proxy error: context canceled

If there is any error during the uploading, the handler will mark the BackingImage failed and clean up the temp files

However, failed to get the checksum from this line means the file is uploaded to the tmp folder, the tasks left for backing-image-ds from this point are

  1. calculate the image checksum
  2. write .cfg file to temp file
  3. succeeded processing file
  4. wait for cleanup request to clean up the tmp files
  5. exit

so the problem didn’t happen during the uploading

After backing-image-ds succeeds to process file, the backing-image-manager will

  1. transfer the file from tmp to final path
  2. cleanup the tmp files

Lastly, 2023/02/15 17:28:54 http: proxy error: context canceled means client is gone during the whole upload request maybe the browser breaks the connection

@mwilberg What’s the size of the converted vm image? and what is the available disk size of the cluster (of each NODE)? it is viewable from the WebUI dashboard, thanks.