longhorn: [BUG] Instance managers and Pods with attached volumes restart randomly
I am running k3s cluster with 3 master (ec2 medium) and 2 worker nodes (ec2 medium), and longhorn on worker nodes, and 4 instance-manager pods , 2 of them with my postgres and redis statefullset pods gets restarted randomly after some hours.
Logs
here are the systemctl status iscsid logs of my 1st worker node
Dec 04 09:05:46 ip-179-21-15-104 iscsid[5198]: Kernel reported iSCSI connection 11:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Dec 04 09:06:22 ip-179-21-15-104 iscsid[5198]: connect to 10.42.3.128:3260 failed (No route to host)
Dec 04 09:06:29 ip-179-21-15-104 iscsid[5198]: connect to 10.42.3.128:3260 failed (No route to host)
Dec 04 09:06:36 ip-179-21-15-104 iscsid[5198]: connect to 10.42.3.128:3260 failed (No route to host)
Dec 04 09:06:43 ip-179-21-15-104 iscsid[5198]: connect to 10.42.3.128:3260 failed (No route to host)
Dec 04 09:06:50 ip-179-21-15-104 iscsid[5198]: connect to 10.42.3.128:3260 failed (No route to host)
Dec 04 09:06:53 ip-179-21-15-104 iscsid[5198]: Connection11:0 to [target: iqn.2019-10.io.longhorn:pvc-e3025f3b-2cb3-434f-992d-1150ea527bca, portal: 10.42.3.128,326>
Dec 04 09:06:54 ip-179-21-15-104 iscsid[5198]: Connection14:0 to [target: iqn.2019-10.io.longhorn:pvc-e3025f3b-2cb3-434f-992d-1150ea527bca, portal: 10.42.3.142,326>
longhorn manager logs
time="2022-12-04T09:05:16Z" level=debug msg="Volume got new owner ip-179-21-15-104" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-179-21-15-104 owner=ip-172-31-44-126 state=attached volume=pvc-be17d93c-2730-482a-bd45-91a5671e06fb
time="2022-12-04T09:05:16Z" level=warning msg="Engine of volume dead unexpectedly, reattach the volume" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-179-21-15-104 owner=ip-179-21-15-104 state=attached volume=pvc-be17d93c-2730-482a-bd45-91a5671e06fb
time="2022-12-04T09:05:16Z" level=debug msg="Requeue orphan-10660b7f7d02b524acb4f4db5b5b154939f92bab48a9f58468fbec678fcdee13 due to conflict" controller=longhorn-orphan error="Operation cannot be fulfilled on orphans.longhorn.io \"orphan-10660b7f7d02b524acb4f4db5b5b154939f92bab48a9f58468fbec678fcdee13\": the object has been modified; please apply your changes to the latest version and try again" node=ip-179-21-15-104 orphan=orphan-10660b7f7d02b524acb4f4db5b5b154939f92bab48a9f58468fbec678fcdee13
time="2022-12-04T09:05:16Z" level=warning msg="Try to get requested log for pvc-be17d93c-2730-482a-bd45-91a5671e06fb-e-71dfca70 in instance manager instance-manager-e-f821a48a"
time="2022-12-04T09:05:16Z" level=warning msg="cannot get requested log for instance pvc-be17d93c-2730-482a-bd45-91a5671e06fb-e-71dfca70 on node ip-179-21-15-104, error invalid Instance Manager instance-manager-e-f821a48a, state: error, IP: 10.42.3.128"
time="2022-12-04T09:05:16Z" level=debug msg="Prepare to delete instance pvc-be17d93c-2730-482a-bd45-91a5671e06fb-e-71dfca70"
time="2022-12-04T09:05:16Z" level=debug msg="Instance handler updated instance pvc-be17d93c-2730-482a-bd45-91a5671e06fb-e-71dfca70 state, old state error, new state stopped"
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-be17d93c-2730-482a-bd45-91a5671e06fb\", UID:\"8ae285c1-12c4-46fc-b904-8d8c5257a10e\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115145\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume pvc-be17d93c-2730-482a-bd45-91a5671e06fb dead unexpectedly, reattach the volume"
time="2022-12-04T09:05:16Z" level=debug msg="Replica controller picked up" controller=longhorn-replica controllerID=ip-179-21-15-104 dataPath= node=ip-179-21-15-104 nodeID=ip-179-21-15-104 ownerID=ip-172-31-44-126 replica=pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-r-d6b9a84f
time="2022-12-04T09:05:16Z" level=debug msg="Prepare to delete instance pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-r-d6b9a84f"
time="2022-12-04T09:05:16Z" level=info msg="Try best to clean up the process for replica pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-r-d6b9a84f in instance manager instance-manager-r-64b8070c" controller=longhorn-replica dataPath= node=ip-179-21-15-104 nodeID=ip-179-21-15-104 ownerID=ip-179-21-15-104 replica=pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-r-d6b9a84f
time="2022-12-04T09:05:16Z" level=info msg="Orphan Controller ip-179-21-15-104 picked up orphan-ddadf4e22c2902e2031a9e8e3b3c725507eb810365e81540122a7192bc909a8a" controller=longhorn-orphan node=ip-179-21-15-104 orphan=orphan-ddadf4e22c2902e2031a9e8e3b3c725507eb810365e81540122a7192bc909a8a
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-be17d93c-2730-482a-bd45-91a5671e06fb-e-71dfca70\", UID:\"0aae58ce-1fe1-47c1-a10f-6b349de2f4e1\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115164\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-be17d93c-2730-482a-bd45-91a5671e06fb-e-71dfca70"
time="2022-12-04T09:05:16Z" level=debug msg="Volume got new owner ip-179-21-15-104" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-179-21-15-104 owner=ip-172-31-44-126 state=attached volume=pvc-3b62ad3f-d823-450f-bebf-f94d2983d179
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-r-d6b9a84f\", UID:\"dffea271-f186-48aa-abc1-4e424956e309\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115147\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-r-d6b9a84f"
time="2022-12-04T09:05:16Z" level=warning msg="Engine of volume dead unexpectedly, reattach the volume" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-179-21-15-104 owner=ip-179-21-15-104 state=attached volume=pvc-3b62ad3f-d823-450f-bebf-f94d2983d179
time="2022-12-04T09:05:16Z" level=warning msg="Engine of volume dead unexpectedly, reattach the volume" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-179-21-15-104 owner=ip-179-21-15-104 state=attached volume=pvc-e3025f3b-2cb3-434f-992d-1150ea527bca
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-3b62ad3f-d823-450f-bebf-f94d2983d179\", UID:\"9b5be04c-a220-40c3-b274-415b84ac4da9\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115144\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume pvc-3b62ad3f-d823-450f-bebf-f94d2983d179 dead unexpectedly, reattach the volume"
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-e3025f3b-2cb3-434f-992d-1150ea527bca\", UID:\"468954c2-d353-46fe-abf6-f199a67b92d2\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115168\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume pvc-e3025f3b-2cb3-434f-992d-1150ea527bca dead unexpectedly, reattach the volume"
time="2022-12-04T09:05:16Z" level=warning msg="Instance pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-e-3c5a0ec2 crashed on Instance Manager instance-manager-e-f821a48a at ip-179-21-15-104, try to get log"
time="2022-12-04T09:05:16Z" level=warning msg="cannot get crash log for instance pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-e-3c5a0ec2 on Instance Manager instance-manager-e-f821a48a at ip-179-21-15-104, error invalid Instance Manager instance-manager-e-f821a48a, state: error, IP: 10.42.3.128"
time="2022-12-04T09:05:16Z" level=warning msg="Instance pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-e-3c5a0ec2 crashed on Instance Manager instance-manager-e-f821a48a at ip-179-21-15-104, try to get log"
time="2022-12-04T09:05:16Z" level=warning msg="cannot get crash log for instance pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-e-3c5a0ec2 on Instance Manager instance-manager-e-f821a48a at ip-179-21-15-104, error invalid Instance Manager instance-manager-e-f821a48a, state: error, IP: 10.42.3.128"
time="2022-12-04T09:05:16Z" level=debug msg="Prepare to delete instance pvc-be17d93c-2730-482a-bd45-91a5671e06fb-r-f70ca683"
time="2022-12-04T09:05:16Z" level=info msg="Try best to clean up the process for replica pvc-be17d93c-2730-482a-bd45-91a5671e06fb-r-f70ca683 in instance manager instance-manager-r-64b8070c" controller=longhorn-replica dataPath= node=ip-179-21-15-104 nodeID=ip-179-21-15-104 ownerID=ip-179-21-15-104 replica=pvc-be17d93c-2730-482a-bd45-91a5671e06fb-r-f70ca683
time="2022-12-04T09:05:16Z" level=debug msg="Prepare to delete instance pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-r-19798a1e"
time="2022-12-04T09:05:16Z" level=info msg="Try best to clean up the process for replica pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-r-19798a1e in instance manager instance-manager-r-64b8070c" controller=longhorn-replica dataPath= node=ip-179-21-15-104 nodeID=ip-179-21-15-104 ownerID=ip-179-21-15-104 replica=pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-r-19798a1e
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-be17d93c-2730-482a-bd45-91a5671e06fb-r-f70ca683\", UID:\"175a68b6-117d-4b72-b8a4-22b251934629\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115165\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-be17d93c-2730-482a-bd45-91a5671e06fb-r-f70ca683"
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-r-19798a1e\", UID:\"4d3cf89e-7440-497c-aa55-f535871e4d71\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115115\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-r-19798a1e"
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-be17d93c-2730-482a-bd45-91a5671e06fb\", UID:\"8ae285c1-12c4-46fc-b904-8d8c5257a10e\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115175\", FieldPath:\"\"}): type: 'Normal' reason: 'Remount' Volume pvc-be17d93c-2730-482a-bd45-91a5671e06fb requested remount at 2022-12-04T09:05:16Z"
time="2022-12-04T09:05:16Z" level=debug msg="Requeue orphan-ddadf4e22c2902e2031a9e8e3b3c725507eb810365e81540122a7192bc909a8a due to conflict" controller=longhorn-orphan error="Operation cannot be fulfilled on orphans.longhorn.io \"orphan-ddadf4e22c2902e2031a9e8e3b3c725507eb810365e81540122a7192bc909a8a\": the object has been modified; please apply your changes to the latest version and try again" node=ip-179-21-15-104 orphan=orphan-ddadf4e22c2902e2031a9e8e3b3c725507eb810365e81540122a7192bc909a8a
time="2022-12-04T09:05:16Z" level=warning msg="Instance pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-e-7c71be25 crashed on Instance Manager instance-manager-e-f821a48a at ip-179-21-15-104, try to get log"
time="2022-12-04T09:05:16Z" level=warning msg="cannot get crash log for instance pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-e-7c71be25 on Instance Manager instance-manager-e-f821a48a at ip-179-21-15-104, error invalid Instance Manager instance-manager-e-f821a48a, state: error, IP: 10.42.3.128"
time="2022-12-04T09:05:16Z" level=debug msg="Instance Manager Controller ip-179-21-15-104 picked up instance-manager-r-64b8070c" controller=longhorn-instance-manager instanceManager=instance-manager-r-64b8070c node=ip-179-21-15-104 nodeID=ip-179-21-15-104
time="2022-12-04T09:05:16Z" level=info msg="Instance manager state is updated from unknown to error after syncing with the pod" controller=longhorn-instance-manager instanceManager=instance-manager-r-64b8070c node=ip-179-21-15-104 nodeID=ip-179-21-15-104
time="2022-12-04T09:05:16Z" level=info msg="Orphan Controller ip-179-21-15-104 picked up orphan-10660b7f7d02b524acb4f4db5b5b154939f92bab48a9f58468fbec678fcdee13" controller=longhorn-orphan node=ip-179-21-15-104 orphan=orphan-10660b7f7d02b524acb4f4db5b5b154939f92bab48a9f58468fbec678fcdee13
time="2022-12-04T09:05:16Z" level=info msg="Only delete orphan orphan-10660b7f7d02b524acb4f4db5b5b154939f92bab48a9f58468fbec678fcdee13 resource object and do not delete the orphaned data" controller=longhorn-orphan node=ip-179-21-15-104 orphan=orphan-10660b7f7d02b524acb4f4db5b5b154939f92bab48a9f58468fbec678fcdee13
time="2022-12-04T00:36:16Z" level=warning msg="Error syncing Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: cannot apply storage-network setting to Longhorn workloads when there are attached volumes" node=ip-179-21-15-104
time="2022-12-04T00:36:16Z" level=warning msg="Error syncing Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: cannot apply storage-network setting to Longhorn workloads when there are attached volumes" node=ip-179-21-15-104
time="2022-12-04T00:36:16Z" level=warning msg="Error syncing Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: cannot apply storage-network setting to Longhorn workloads when there are attached volumes" node=ip-179-21-15-104
E1204 00:36:16.670081 1 setting_controller.go:190] failed to sync setting for longhorn-system/storage-network: cannot apply storage-network setting to Longhorn workloads when there are attached volumes
time="2022-12-04T00:36:16Z" level=warning msg="Dropping Longhorn setting longhorn-system/storage-network out of the queue" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: cannot apply storage-network setting to Longhorn workloads when there are attached volumes" node=ip-179-21-15-104
W1204 00:37:57.595080 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W1204 00:43:49.497616 1 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob
Environment
- Longhorn version: 1.3.2
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s (v1.23.6+k3s1)
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 2
- Node config
- OS type and version: Ubuntu 20.04
- Number of Longhorn volumes in the cluster: 3
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 67 (25 by maintainers)
Oooh. This looks interesting:
Is nfs-ganesha not being given the permissions it needs re: DBUS?
Wow yeah, you had etcd blocked for more than two minutes on a
fsynccall (ext4_sync_file).Etcd calls fsync to commit EVERY write so this will happen a lot. Note that if you have etcd and longhorn on the same backing disks, the etcd fsync calls will also need to flush the LH writes as well, so I would recommend against having them on the same physical disks. I would also try to confirm that you don’t have some problem with your disks or storage controller; even on a pathologically overloaded system I wouldn’t expect a fsync call to hang for that long unless something else is broken.
I know nothing about either Antrea nor RKE1, sorry.
Prometheus is pretty IO intensive, if the minimal reproducer for you is installing Prometheus with a LH PVC this suggests that perhaps your IO throughput on the LH backing disks is insufficient.
Am I seeing this right that docker is crashing and taking down all the pods? It feels like a stable container runtime is a requirement for a functional LH cluster; if the container runtime is crashing and terminating pods (including LH) that seems like the first thing to investigate and resolve.
@derekbit
Spoken to fast, problem happened again even using longhorn version 1.3.0.
I will try updating the system packages on all of our nodes
@derekbit By the way, here’s the result of
dmesg -T --since "2022-12-10": https://gist.github.com/iMikeG6/e1767a3cae2c4f9b2730b2c3126dd58adockerd.service doesn’t have any interesting logs nor contain errors. Same for
journalctl --since "2022-12-11"Nothing interesting this time.@proteche I suggest you to use https://gist.github.com/ to upload your logs… It will make this issue way more readable than what it is now, way to much lines…
Anyway, did you tried to rollback on the update or tried to installed longhorn version 1.3.0 ? For us, it seams like it helped
@derekbit Two days without the reported error below, with longhorn version 1.3.0 (helm chart version: 100.2.0+up1.3.0):
scsid: connect to 10.42.0.142:3260 failed (No route to host)andiscsid: Kernel reported iSCSI connection 43:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)So there must be something in the image used for the new version. As we are using RedHat (9.1) and rancher use Suse, it could be related to an incompatibility between the two system. We have seen that in our old cluster that ran RedHat 8 and a container that ran ubuntu base.
Now, the only errors logs reported are those two lines:
I’m not even sure that those two errors are linked to any rancher services.
I Just sent you the new bundle.
Problem seems to happen every evening, between 08pm and 02am. This time it happened on all of our workers (3 physical machines)
New errors appears, below, it’s just one line but we have several of them in our log file.
Along with the error reported here
iscsid: connect to 10.42.0.238:3260 failed (No route to host)andKernel reported iSCSI connection 20:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)According to our monitoring system, on one of our worker, IO wait started to increase
I/O Wait: 48%for a few minutes.Interesting logs in one of our 3 workers nodes were found:
From the node were iptables failed I found this in
dmesg -THere the full gist file with the dmesg logs: https://gist.github.com/iMikeG6/f9df95394426ad6a68e44263e3132d8a
The current workload of this cluster is not really high, but we were planing on deploying a new production cluster based on this one, but in regards of this issue, we may have to re-think about our storage solution and have a look at Rook Ceph if this problem happen again and again.
Hope I can provide you as much information as I can so it can be fixed soon.