longhorn: [BUG] Longhorn 1.3.2 fails to backup & restore volumes behind Internet proxy
Longhorn 1.3.2 fails to restore volume backups from an S3 object store (OTC) behind an Internet proxy while Longhorn 1.2.x successfully restore the same volume with the same backup settings.
It is independent from whether the backup has been created with Longhorn 1.2.6 or 1.3.2. A restore with 1.3.0-1.3.2 always fails whereas it seems to always succeed with Longhorn versions before 1.3.0 (tested with 1.2.4 and 1.2.6)
With Longhorn 1.3.2 the error message appears in the longhorn-manager log about 2-10 seconds after “Prepare to restore backup”:
time="2022-12-13T09:50:41Z" level=warning msg="failed to restore backup backup-a53273fefda34bad in engine monitor, will retry the restore later: proxyServer=10.42.0.71:8501 destination=10.42.0.71:10007: failed to restore backup s3://some-bucket@obs.eu-de.otc.t-systems.com/?backup=backup-a53273fefda34bad&volume=pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15 to volume pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15: cannot unmarshal the restore error, maybe it's not caused by the replica restore failure: failed to get the current restoring backup info: failed to list objects with param: {\n Bucket: \"some-bucket\",\n Delimiter: \"/\",\n Prefix: \"/\"\n} error: AWS Error: RequestError send request failed Get \"https://obs.eu-de.otc.t-systems.com/some-bucket?delimiter=%!F(MISSING)&prefix=%!F(MISSING)\": dial tcp 80.158.25.140:443: i/o timeout\n: invalid character 'i' in literal false (expecting 'l')" controller=longhorn-engine engine=pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15-e-5666489d node=some-node
Tue, Dec 13 2022 10:50:41 am | E1213 09:50:41.463561 1 engine_controller.go:743] failed to update status for engine pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15-e-5666489d: failed to restore backup backup-a53273fefda34bad in engine monitor, will retry the restore later: proxyServer=10.42.0.71:8501 destination=10.42.0.71:10007: failed to restore backup s3://some-bucket@obs.eu-de.otc.t-systems.com/?backup=backup-a53273fefda34bad&volume=pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15 to volume pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15: cannot unmarshal the restore error, maybe it's not caused by the replica restore failure: failed to get the current restoring backup info: failed to list objects with param: {
Tue, Dec 13 2022 10:50:41 am | Bucket: "some-bucket",
Tue, Dec 13 2022 10:50:41 am | Delimiter: "/",
Tue, Dec 13 2022 10:50:41 am | Prefix: "/"
Tue, Dec 13 2022 10:50:41 am | } error: AWS Error: RequestError send request failed Get "https://obs.eu-de.otc.t-systems.com/some-bucket?delimiter=%!F(MISSING)&prefix=%!F(MISSING)": dial tcp 80.158.25.140:443: i/o timeout
Tue, Dec 13 2022 10:50:41 am | : invalid character 'i' in literal false (expecting 'l')
Tue, Dec 13 2022 10:50:41 am | time="2022-12-13T09:50:41Z" level=info msg="Prepare to restore backup" backupTarget="s3://some-bucket@obs.eu-de.otc.t-systems.com/" backupVolume=pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15 controller=longhorn-engine engine=pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15-e-5666489d lastRestoredBackupName= node=some-node requestedRestoredBackupName=backup-a53273fefda34bad
Tue, Dec 13 2022 10:50:59 am
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 23 (15 by maintainers)
@PhanLe1010 Yes, it could be reproduced. Reproduce Steps
Here are more detail steps: 0. grep the script and binary file we need
k3s ctr images import [image.tarball]to get images not a private registry)docker run -d -v /root/squid.conf:/etc/squid/squid.conf -p 3128:3128 wernight/squid)Squid Proxy config
/etc/systemd/system/k3s[-agent].service.envTested in longhorn master-head images (longhorn-manager
d20e1c, longhorn-engineecdb9e) With all nodes using a proxy, I can not reproduce this problemPrecondition
Result
--from-literal=NO_PROXY=$no_proxy_params \in AWS secret when do restore, the volume will be faulted. See below the log in the proxy server, at this moment, volume backup still succed, but after create a volume, attach the volume to a node will faulted, even remove the backup target/secret from UI, attach volume will still be faulted.But it’s not the problem this ticket mentioned, need more time to test, thank you
The proxy related configuration of the backup credential secret is:
Sorry for the late reply. @FFock I think this only happened in a private network, I could reproduce this issue in an air-gap environment. As you mentioned, it works at v1.2.6 and not at v1.3.2. I’m still investigating what happened and difference between v1.2.6 and v.1.3.2. Appreciate your reporting and testing.
Hi @mantissahz, not sure what you are looking for? The Rancher 2.6.9 clusters are running behind an Internet proxy in a private network. S3 Object Store is located in the Internet. Internet proxies are (standard) squid instances without caching with access rules behind a TCP load-balancer. Proxies are working well for all (https/http) traffic.
Only with Longhorn 1.3.x, my proxy settings in the Longhorn secret but also directly applied to the Longhorn-Manager and (-Engine) container(s) are completely ignored. Having activated SNAT for Internet access towards S3 object store for testing, Longhorn 1.3.2 is able to backup and restore as expected (proxy settings still in place).
That means to me, proxy settings are completely ignored with Longhorn 1.3.x for backup/restore operations but working for the UI (i.e. listing backups).