longhorn: [BUG] Longhorn 1.3.2 fails to backup & restore volumes behind Internet proxy

Longhorn 1.3.2 fails to restore volume backups from an S3 object store (OTC) behind an Internet proxy while Longhorn 1.2.x successfully restore the same volume with the same backup settings.
It is independent from whether the backup has been created with Longhorn 1.2.6 or 1.3.2. A restore with 1.3.0-1.3.2 always fails whereas it seems to always succeed with Longhorn versions before 1.3.0 (tested with 1.2.4 and 1.2.6)

With Longhorn 1.3.2 the error message appears in the longhorn-manager log about 2-10 seconds after “Prepare to restore backup”:

time="2022-12-13T09:50:41Z" level=warning msg="failed to restore backup backup-a53273fefda34bad in engine monitor, will retry the restore later: proxyServer=10.42.0.71:8501 destination=10.42.0.71:10007: failed to restore backup s3://some-bucket@obs.eu-de.otc.t-systems.com/?backup=backup-a53273fefda34bad&volume=pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15 to volume pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15: cannot unmarshal the restore error, maybe it's not caused by the replica restore failure: failed to get the current restoring backup info: failed to list objects with param: {\n Bucket: \"some-bucket\",\n Delimiter: \"/\",\n Prefix: \"/\"\n} error: AWS Error: RequestError send request failed Get \"https://obs.eu-de.otc.t-systems.com/some-bucket?delimiter=%!F(MISSING)&prefix=%!F(MISSING)\": dial tcp 80.158.25.140:443: i/o timeout\n: invalid character 'i' in literal false (expecting 'l')" controller=longhorn-engine engine=pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15-e-5666489d node=some-node
Tue, Dec 13 2022 10:50:41 am | E1213 09:50:41.463561 1 engine_controller.go:743] failed to update status for engine pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15-e-5666489d: failed to restore backup backup-a53273fefda34bad in engine monitor, will retry the restore later: proxyServer=10.42.0.71:8501 destination=10.42.0.71:10007: failed to restore backup s3://some-bucket@obs.eu-de.otc.t-systems.com/?backup=backup-a53273fefda34bad&volume=pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15 to volume pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15: cannot unmarshal the restore error, maybe it's not caused by the replica restore failure: failed to get the current restoring backup info: failed to list objects with param: {
Tue, Dec 13 2022 10:50:41 am | Bucket: "some-bucket",
Tue, Dec 13 2022 10:50:41 am | Delimiter: "/",
Tue, Dec 13 2022 10:50:41 am | Prefix: "/"
Tue, Dec 13 2022 10:50:41 am | } error: AWS Error: RequestError send request failed Get "https://obs.eu-de.otc.t-systems.com/some-bucket?delimiter=%!F(MISSING)&prefix=%!F(MISSING)": dial tcp 80.158.25.140:443: i/o timeout
Tue, Dec 13 2022 10:50:41 am | : invalid character 'i' in literal false (expecting 'l')
Tue, Dec 13 2022 10:50:41 am | time="2022-12-13T09:50:41Z" level=info msg="Prepare to restore backup" backupTarget="s3://some-bucket@obs.eu-de.otc.t-systems.com/" backupVolume=pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15 controller=longhorn-engine engine=pvc-9fea5bf4-7b49-4ac8-83b2-7a5732609a15-e-5666489d lastRestoredBackupName= node=some-node requestedRestoredBackupName=backup-a53273fefda34bad
Tue, Dec 13 2022 10:50:59 am

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 23 (15 by maintainers)

Most upvoted comments

@PhanLe1010 Yes, it could be reproduced. Reproduce Steps

Here are more detail steps: 0. grep the script and binary file we need

Download bin file `k3s` 

# curl -o /usr/local/bin/k3s -L [https://github.com/k3s-io/k3s/releases/download/v1.24.9%2Bk3s1/k3s](https://github.com/k3s-io/k3s/releases/download/v1.24.9%2Bk3s1/k3s)

Download images files 

# curl -O -L [https://github.com/k3s-io/k3s/releases/download/v1.24.9%2Bk3s1/k3s-airgap-images-amd64.tar](https://github.com/k3s-io/k3s/releases/download/v1.24.9%2Bk3s1/k3s-airgap-images-amd64.tar)

Install script 

# curl -o [[install.sh](http://install.sh/)](http://install.sh/) -L [https://get.k3s.io/](https://get.k3s.io/)./
  1. Create an air-gap k3s cluster/environment (https://docs.k3s.io/installation/airgap), a master node(this node would be set not to schedule the workload and it could communicate with public internet.) and 3 worker nodes. (I use command k3s ctr images import [image.tarball] to get images not a private registry)
  2. Setup a squid proxy on the master node (I use the docker to create one by this command docker run -d -v /root/squid.conf:/etc/squid/squid.conf -p 3128:3128 wernight/squid)

Squid Proxy config

#
# Recommended minimum configuration:
#

# Example rule allowing access from your local networks.
# Adapt to list your (internal) IP networks from where browsing
# should be allowed
acl localnet src 10.0.0.0/8	# RFC1918 possible internal network
acl localnet src 172.16.0.0/12	# RFC1918 possible internal network
acl localnet src 192.168.0.0/16	# RFC1918 possible internal network
acl localnet src fc00::/7       # RFC 4193 local private network range
acl localnet src fe80::/10      # RFC 4291 link-local (directly plugged) machines

#acl longhorn src <LONGHORN-WORKER-NODE-1-IP>
#acl longhorn src <LONGHORN-WORKER-NODE-2-IP>
#acl longhorn src <LONGHORN-WORKER-NODE-3-IP>

acl SSL_ports port 443
acl SSL_ports port 6443
acl Safe_ports port 80		# http
acl Safe_ports port 21		# ftp
acl Safe_ports port 443		# https
acl Safe_ports port 70		# gopher
acl Safe_ports port 210		# wais
acl Safe_ports port 1025-65535	# unregistered ports
acl Safe_ports port 280		# http-mgmt
acl Safe_ports port 488		# gss-http
acl Safe_ports port 591		# filemaker
acl Safe_ports port 777		# multiling http
acl Safe_ports port 6443        # k8s
acl Safe_ports port 9000        # minio
acl Safe_ports port 39779       # minio
acl SSL_ports port 22
acl SSL_ports port 2376
acl Safe_ports port 22      # ssh
acl Safe_ports port 2376    # docker port
acl CONNECT method CONNECT

#
# Recommended minimum Access Permission configuration:
#
# Deny requests to certain unsafe ports
http_access deny !Safe_ports

# Deny CONNECT to other than secure SSL ports
#http_access deny CONNECT !SSL_ports

# Only allow cachemgr access from localhost
http_access allow localhost manager
http_access deny manager

# We strongly recommend the following be uncommented to protect innocent
# web applications running on the proxy server who think the only
# one who can access services on "localhost" is a local user
#http_access deny to_localhost

#
# INSERT YOUR OWN RULE(S) HERE TO ALLOW ACCESS FROM YOUR CLIENTS
#

# Example rule allowing access from your local networks.
# Adapt localnet in the ACL section to list your (internal) IP networks
# from where browsing should be allowed
http_access allow localnet
http_access allow localhost

# Control requests from longhorn to S3
#http_access deny longhorn

# And finally deny all other access to this proxy
http_access allow all

# Squid normally listens to port 3128
http_port 3128

# Uncomment and adjust the following to add a disk cache directory.
#cache_dir ufs /var/cache/squid 100 16 256

# Leave coredumps in the first cache dir
coredump_dir /var/cache/squid

#
# Add any of your own refresh_pattern entries above these.
#
refresh_pattern ^ftp:		1440	20%	10080
refresh_pattern ^gopher:	1440	0%	1440
refresh_pattern -i (/cgi-bin/|\?) 0	0%	0
refresh_pattern .		0	20%	4320
  1. Set environment variables on all nodes Edit /etc/systemd/system/k3s[-agent].service.env
HTTP_PROXY=http://[proxy_ip]:3128/
HTTPS_PROXY=http://[proxy_ip]:3128/
NO_PROXY=localhost,0.0.0.0,127.0.0.0/8,10.0.0.0/8,cattle-system.svc,172.16.0.0/12,10.42.0.0/12

  1. Install Longhorn system https://longhorn.io/docs/1.4.0/advanced-resources/deploy/airgap/
  2. Setup a Minio server https://longhorn.io/docs/1.4.0/snapshots-and-backups/backup-and-restore/set-backup-target/#set-up-a-local-testing-backupstore (I setup the server on another VM)
  3. Create a S3 secret
  4. Setup the backup remote target and credentials.
  5. Create a Backup that should work.
  6. Restore the Backup and that should fail.

Tested in longhorn master-head images (longhorn-manager d20e1c, longhorn-engine ecdb9e) With all nodes using a proxy, I can not reproduce this problem

Precondition

  1. Deploy longhorn master-head
  2. Setup proxy server(instructions)
  3. Set proxy on each node by the below command
export http_proxy=<proxy-server-ip>:<port>
export https_proxy=<proxy-server-ip>:<port>

Result

  1. If I set the AWS secret below, I can create backup and restore volume back successfully
secret-name="aws-secret-proxy"
proxy_ip=<proxy server IP>
no_proxy_params="localhost,127.0.0.1,0.0.0.0,10.0.0.0/8,192.168.0.0/16"
kubectl create secret generic $secret-name \
    --from-literal=AWS_ACCESS_KEY_ID=$AWS_ID \
    --from-literal=AWS_SECRET_ACCESS_KEY=$AWS_KEY \
    --from-literal=HTTP_PROXY=$proxy_ip:3128 \
    --from-literal=HTTPS_PROXY=$proxy_ip:3128 \
    --from-literal=NO_PROXY=$no_proxy_params \
    -n longhorn-system
  1. If I did not set --from-literal=NO_PROXY=$no_proxy_params \ in AWS secret when do restore, the volume will be faulted. See below the log in the proxy server, at this moment, volume backup still succed, but after create a volume, attach the volume to a node will faulted, even remove the backup target/secret from UI, attach volume will still be faulted.
1672738529.636      0 54.211.126.162 TCP_DENIED/403 3863 CONNECT 10.42.1.33:10000 - HIER_NONE/- text/html
1672738529.638      0 54.211.126.162 TCP_DENIED/403 3863 CONNECT 10.42.3.30:10000 - HIER_NONE/- text/html
1672738529.641      0 54.211.126.162 TCP_DENIED/403 3863 CONNECT 10.42.0.45:10000 - HIER_NONE/- text/html

But it’s not the problem this ticket mentioned, need more time to test, thank you

The proxy related configuration of the backup credential secret is:

AWS_ENDPOINTS = https://obs.eu-de.otc.t-systems.com
HTTPS_PROXY = http://some-proxy.sdc.t-systems.com:80/
HTTP_PROXY = http://some-proxy.sdc.t-systems.com:80/
NO_PROXY = localhost,127.0.0.1,10.0.0.0/8,192.168.0.0/16,.sdc.t-systems.com,cluster.local,.svc

Sorry for the late reply. @FFock I think this only happened in a private network, I could reproduce this issue in an air-gap environment. As you mentioned, it works at v1.2.6 and not at v1.3.2. I’m still investigating what happened and difference between v1.2.6 and v.1.3.2. Appreciate your reporting and testing.

Hi @mantissahz, not sure what you are looking for? The Rancher 2.6.9 clusters are running behind an Internet proxy in a private network. S3 Object Store is located in the Internet. Internet proxies are (standard) squid instances without caching with access rules behind a TCP load-balancer. Proxies are working well for all (https/http) traffic.

Only with Longhorn 1.3.x, my proxy settings in the Longhorn secret but also directly applied to the Longhorn-Manager and (-Engine) container(s) are completely ignored. Having activated SNAT for Internet access towards S3 object store for testing, Longhorn 1.3.2 is able to backup and restore as expected (proxy settings still in place).

That means to me, proxy settings are completely ignored with Longhorn 1.3.x for backup/restore operations but working for the UI (i.e. listing backups).