longhorn: [BUG] RWX volume remains attached after workload deleted if it's upgraded from v1.4.2
Describe the bug (š if you encounter this issue)
RWX volume remains attached and healthy after workload deleted if itās created in v1.4.2 and then upgraded to master-head or v1.5.x-head.
Directly create/delete a workload using RWX volume in master-head or v1.5.x-head doesnāt have this issue.
To Reproduce
Steps to reproduce the behavior:
- Install Longhorn v1.4.2
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.4.2/deploy/longhorn.yaml
- Create a statefulset using RWX volume
# rwx_statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: test-statefulset-rwx
namespace: default
spec:
selector:
matchLabels:
app: test-statefulset-rwx
serviceName: test-statefulset-rwx
replicas: 1
template:
metadata:
labels:
app: test-statefulset-rwx
spec:
terminationGracePeriodSeconds: 10
containers:
- image: busybox
imagePullPolicy: IfNotPresent
name: sleep
args: ['/bin/sh', '-c', 'while true;do date;sleep 5; done']
volumeMounts:
- name: pod-data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: pod-data
spec:
accessModes: ['ReadWriteMany']
storageClassName: 'longhorn'
resources:
requests:
storage: 1Gi
# kubectl apply -f rwx_statefulset.yaml
- Upgrade Longhorn to master-head or v1.5.x-head, and upgrade the engine image of the RWX volume
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml
- Delete the statefulset
kubectl delete -f rwx_statefulset.yaml
- The volume remains attached and healthy after workload deleted
Expected behavior
A clear and concise description of what you expected to happen.
Log or Support bundle
supportbundle_7c27605c-22df-493e-9e2a-b9135c68e20b_2023-06-16T02-29-21Z.zip
Environment
- Longhorn version:
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of management node in the cluster:
- Number of worker node in the cluster:
- Node config
- OS type and version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
Additional context
Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 24 (23 by maintainers)
Thanks @innobead and @shuo-wu for the great feedbacks!
I agree with points 1 and 2 @shuo-wu mentioned above (which I understand that they are also the points that @innobead proposed. Please correct me if I am understand it wrong @innobead )
For the point 2 that @shuo-wu mentioned:
Letās me evaluate more to see which one is better option:
using workload pod stateVSusing Kuberntes VolumeAttachment statelonghorn-upgrade.AttacherTypeLonghornAPI. But for the auto-attached volumes, itās too complicated to generate correct AD tickets hence we can ignore them.You are right @derekbit it is already exist. The behavior just a little bit different but there is still same issue. Thank you for the clarification
Clarify it a bit. It is not a side effect of the detaching fix. The
hasActiveWorkloadwas already removed from AD controller implantation before. I mistakenly introduced it into longhorn-manager before. It is not related to detaching issue, so I removed it in the end.Hence, both RWX volumes attaching and detaching issues exist in AD controller design and implementation.
Just for information. It still remains attached and healthy after a more than 3 hours waiting.
Verified pass in
ec130d)c8092b)After upgrade from v1.4.2 to master and upgrade from v1.4.2 to v1.5.x, perform test steps were passed, delete workload, the RWX volume become detached as well.
Point 3 I mentioned was similar to what David suggested but with extra concerns about the auto-attached volumes.
@PhanLe1010 This seems a blocker for 1.5.0. Letās make this the highest priority to tackle first. Thanks.
I see. This case can happen when there is no workload pod on the same node with the share manager pod. When user upgrade to 1.5.x, we create an upgrade AD ticket for the share managerās node to keep volume attached there. When the user scale down the workload, we donāt cleanup that upgrade AD ticket because that ticket is on different node as the workload podās node. As the result, no one is cleaning up the upgrade AD ticket and the volume stuck there forever.
Still figuring out how to fix this issue.