longhorn: [BUG] Instance Manager Memory Leak

Describe the bug

One of the instance manager is using 4 CPU cores and ballooning in memory. Currently it’s 19GB memory use and fast rising.

We killed the instance-manager last night but it continues right on.

grafik

Log or Support bundle

The afftected process is in: instance-manager-r-9c410d90 root 763 18.2 44.4 26153392 18236828 ? DLl Jan11 1916:59 /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.2.3/longhorn sync-agent --listen 0.0.0.0:10152 --replica 0.0.0.0:10150 --listen-port-range 10153-10162

Is this specific to a volume? If so how can we determine it? The node is soon OOM.

Thanks for support!

Support Bundle attached. longhorn-support-bundle_55903fa6-26ed-4355-a4bf-17d3650572cc_2022-01-18T08-39-31Z.zip

Environment

Longhorn version: 1.23
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 5
Node config
- OS type and version: ubuntu 20.04
- CPU per node: 12
- Memory per node: 32GB
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster: 30 active

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 16 (8 by maintainers)

Most upvoted comments

Reproduce steps:

Install Longhorn 1.2.4 using the quay.io/jenting/longhorn-engine:master-head-ten-mins-sleep as the default engine images
Create a new volume of 1 replica. Attach. Write 100MB of data.
Take backup using Longhorn UI
Monitor the RAM of the corresponding replica instance manager for 30 mins. It should be increasing over time and consume most of the available RAM on the node.

Test steps:

Repeat the above step with Longhorn 1.2.x or Longhorn 1.2.5 but verify that the replica instance manager doesn’t eat all the RAM

PhanLe1010 on Mar 31, 2022

Thanks @timmy59100 for reporting

We are looking into the support bundle. It seems that there is a problematic failed backup CR (backup-a2023cc97cd9492a) that trigger the volume to constantly initiate new backups. There are many logs like this every 2-3 minutes:

2022-01-18T08:19:55.933792385Z [pvc-444c512d-ca4b-4eff-a991-ec29c3c201c5-r-28c54628] time="2022-01-18T08:19:55Z" level=info msg="Done initiating backup creation, received backupID: backup-a2023cc97cd9492a"

If you try to delete the backup CR by running kubectl delete backups.longhorn.io backup-a2023cc97cd9492a -n longhorn-system, the problem might have gone away

This problem looks like a bug to me. cc @jenting We need to investigate the code path for this failure handling. Looking into the longhorn manager logs, I see there are 800 messages like this

2022-01-18T08:38:58.669297775Z time="2022-01-18T08:38:58Z" level=warning msg="Cannot take snapshot backup" controller=longhorn-backup error="failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.2.3/longhorn [--url 10.42.7.67:10004 backup create --dest nfs://rs3412rpxs-01-ge4.inacta.local:/volume2/backup_cust_pax/rancher-pax-cluster --backup-name backup-a2023cc97cd9492a --label interval=daily --label KubernetesStatus={\"pvName\":\"pvc-444c512d-ca4b-4eff-a991-ec29c3c201c5\",\"pvStatus\":\"Bound\",\"namespace\":\"postgres-prod\",\"pvcName\":\"data-postgres-0\",\"lastPVCRefAt\":\"\",\"workloadsStatus\":[{\"podName\":\"postgres-0\",\"podStatus\":\"Running\",\"workloadName\":\"postgres\",\"workloadType\":\"StatefulSet\"}],\"lastPodRefAt\":\"\"} --label RecurringJob=backup-7-cd9f4ece-d131a907 backup-7-cd9f4ece-d131a907-c-dafc4da4], output , stderr, time=\"2022-01-18T08:35:58Z\" level=info msg=\"Backing up backup-7-cd9f4ece-d131a907-c-dafc4da4 on tcp://10.42.7.66:10150, to nfs://rs3412rpxs-01-ge4.inacta.local:/volume2/backup_cust_pax/rancher-pax-cluster\"\ntime=\"2022-01-18T08:38:58Z\" level=fatal msg=\"Error running create backup command: failed to create backup to nfs://rs3412rpxs-01-ge4.inacta.local:/volume2/backup_cust_pax/rancher-pax-cluster for volume pvc-444c512d-ca4b-4eff-a991-ec29c3c201c5: rpc error: code = DeadlineExceeded desc = context deadline exceeded\"\n, error exit status 1" node=pax-p-97

It seems that the backup init command failed because of timeout and the backup controller keeps retrying to create a new backup monitor:

It seems we need to re-consider the timeout here and/or add a backoff-limit in the backup controller

PhanLe1010 on Jan 19, 2022

Today we had the same issue again on a different cluster.

timmy59100 on Jan 26, 2022

Proposal 3: address the #2807, but we are still not sure if there is a corner case that 3 mins timeout is still not enough after the #2807 enhancement.

@PhanLe1010 @shuo-wu @joshimoo @innobead WDYT?

Prefer 3, because I would like us to fix the real cause and make the operation nearly asynchronous but the short-term workaround is 2.

innobead on Mar 8, 2022