longhorn: [BUG] Instance managers and Pods with attached volumes restart randomly

I am running k3s cluster with 3 master (ec2 medium) and 2 worker nodes (ec2 medium), and longhorn on worker nodes, and 4 instance-manager pods , 2 of them with my postgres and redis statefullset pods gets restarted randomly after some hours.

Logs

here are the systemctl status iscsid logs of my 1st worker node


Dec 04 09:05:46 ip-179-21-15-104 iscsid[5198]: Kernel reported iSCSI connection 11:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Dec 04 09:06:22 ip-179-21-15-104 iscsid[5198]: connect to 10.42.3.128:3260 failed (No route to host)
Dec 04 09:06:29 ip-179-21-15-104 iscsid[5198]: connect to 10.42.3.128:3260 failed (No route to host)
Dec 04 09:06:36 ip-179-21-15-104 iscsid[5198]: connect to 10.42.3.128:3260 failed (No route to host)
Dec 04 09:06:43 ip-179-21-15-104 iscsid[5198]: connect to 10.42.3.128:3260 failed (No route to host)
Dec 04 09:06:50 ip-179-21-15-104 iscsid[5198]: connect to 10.42.3.128:3260 failed (No route to host)
Dec 04 09:06:53 ip-179-21-15-104 iscsid[5198]: Connection11:0 to [target: iqn.2019-10.io.longhorn:pvc-e3025f3b-2cb3-434f-992d-1150ea527bca, portal: 10.42.3.128,326>
Dec 04 09:06:54 ip-179-21-15-104 iscsid[5198]: Connection14:0 to [target: iqn.2019-10.io.longhorn:pvc-e3025f3b-2cb3-434f-992d-1150ea527bca, portal: 10.42.3.142,326>

longhorn manager logs

time="2022-12-04T09:05:16Z" level=debug msg="Volume got new owner ip-179-21-15-104" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-179-21-15-104 owner=ip-172-31-44-126 state=attached volume=pvc-be17d93c-2730-482a-bd45-91a5671e06fb
time="2022-12-04T09:05:16Z" level=warning msg="Engine of volume dead unexpectedly, reattach the volume" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-179-21-15-104 owner=ip-179-21-15-104 state=attached volume=pvc-be17d93c-2730-482a-bd45-91a5671e06fb
time="2022-12-04T09:05:16Z" level=debug msg="Requeue orphan-10660b7f7d02b524acb4f4db5b5b154939f92bab48a9f58468fbec678fcdee13 due to conflict" controller=longhorn-orphan error="Operation cannot be fulfilled on orphans.longhorn.io \"orphan-10660b7f7d02b524acb4f4db5b5b154939f92bab48a9f58468fbec678fcdee13\": the object has been modified; please apply your changes to the latest version and try again" node=ip-179-21-15-104 orphan=orphan-10660b7f7d02b524acb4f4db5b5b154939f92bab48a9f58468fbec678fcdee13
time="2022-12-04T09:05:16Z" level=warning msg="Try to get requested log for pvc-be17d93c-2730-482a-bd45-91a5671e06fb-e-71dfca70 in instance manager instance-manager-e-f821a48a"
time="2022-12-04T09:05:16Z" level=warning msg="cannot get requested log for instance pvc-be17d93c-2730-482a-bd45-91a5671e06fb-e-71dfca70 on node ip-179-21-15-104, error invalid Instance Manager instance-manager-e-f821a48a, state: error, IP: 10.42.3.128"
time="2022-12-04T09:05:16Z" level=debug msg="Prepare to delete instance pvc-be17d93c-2730-482a-bd45-91a5671e06fb-e-71dfca70"
time="2022-12-04T09:05:16Z" level=debug msg="Instance handler updated instance pvc-be17d93c-2730-482a-bd45-91a5671e06fb-e-71dfca70 state, old state error, new state stopped"
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-be17d93c-2730-482a-bd45-91a5671e06fb\", UID:\"8ae285c1-12c4-46fc-b904-8d8c5257a10e\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115145\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume pvc-be17d93c-2730-482a-bd45-91a5671e06fb dead unexpectedly, reattach the volume"
time="2022-12-04T09:05:16Z" level=debug msg="Replica controller picked up" controller=longhorn-replica controllerID=ip-179-21-15-104 dataPath= node=ip-179-21-15-104 nodeID=ip-179-21-15-104 ownerID=ip-172-31-44-126 replica=pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-r-d6b9a84f
time="2022-12-04T09:05:16Z" level=debug msg="Prepare to delete instance pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-r-d6b9a84f"
time="2022-12-04T09:05:16Z" level=info msg="Try best to clean up the process for replica pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-r-d6b9a84f in instance manager instance-manager-r-64b8070c" controller=longhorn-replica dataPath= node=ip-179-21-15-104 nodeID=ip-179-21-15-104 ownerID=ip-179-21-15-104 replica=pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-r-d6b9a84f
time="2022-12-04T09:05:16Z" level=info msg="Orphan Controller ip-179-21-15-104 picked up orphan-ddadf4e22c2902e2031a9e8e3b3c725507eb810365e81540122a7192bc909a8a" controller=longhorn-orphan node=ip-179-21-15-104 orphan=orphan-ddadf4e22c2902e2031a9e8e3b3c725507eb810365e81540122a7192bc909a8a
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-be17d93c-2730-482a-bd45-91a5671e06fb-e-71dfca70\", UID:\"0aae58ce-1fe1-47c1-a10f-6b349de2f4e1\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115164\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-be17d93c-2730-482a-bd45-91a5671e06fb-e-71dfca70"
time="2022-12-04T09:05:16Z" level=debug msg="Volume got new owner ip-179-21-15-104" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-179-21-15-104 owner=ip-172-31-44-126 state=attached volume=pvc-3b62ad3f-d823-450f-bebf-f94d2983d179
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-r-d6b9a84f\", UID:\"dffea271-f186-48aa-abc1-4e424956e309\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115147\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-r-d6b9a84f"
time="2022-12-04T09:05:16Z" level=warning msg="Engine of volume dead unexpectedly, reattach the volume" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-179-21-15-104 owner=ip-179-21-15-104 state=attached volume=pvc-3b62ad3f-d823-450f-bebf-f94d2983d179
time="2022-12-04T09:05:16Z" level=warning msg="Engine of volume dead unexpectedly, reattach the volume" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-179-21-15-104 owner=ip-179-21-15-104 state=attached volume=pvc-e3025f3b-2cb3-434f-992d-1150ea527bca
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-3b62ad3f-d823-450f-bebf-f94d2983d179\", UID:\"9b5be04c-a220-40c3-b274-415b84ac4da9\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115144\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume pvc-3b62ad3f-d823-450f-bebf-f94d2983d179 dead unexpectedly, reattach the volume"
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-e3025f3b-2cb3-434f-992d-1150ea527bca\", UID:\"468954c2-d353-46fe-abf6-f199a67b92d2\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115168\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume pvc-e3025f3b-2cb3-434f-992d-1150ea527bca dead unexpectedly, reattach the volume"
time="2022-12-04T09:05:16Z" level=warning msg="Instance pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-e-3c5a0ec2 crashed on Instance Manager instance-manager-e-f821a48a at ip-179-21-15-104, try to get log"
time="2022-12-04T09:05:16Z" level=warning msg="cannot get crash log for instance pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-e-3c5a0ec2 on Instance Manager instance-manager-e-f821a48a at ip-179-21-15-104, error invalid Instance Manager instance-manager-e-f821a48a, state: error, IP: 10.42.3.128"
time="2022-12-04T09:05:16Z" level=warning msg="Instance pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-e-3c5a0ec2 crashed on Instance Manager instance-manager-e-f821a48a at ip-179-21-15-104, try to get log"
time="2022-12-04T09:05:16Z" level=warning msg="cannot get crash log for instance pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-e-3c5a0ec2 on Instance Manager instance-manager-e-f821a48a at ip-179-21-15-104, error invalid Instance Manager instance-manager-e-f821a48a, state: error, IP: 10.42.3.128"
time="2022-12-04T09:05:16Z" level=debug msg="Prepare to delete instance pvc-be17d93c-2730-482a-bd45-91a5671e06fb-r-f70ca683"
time="2022-12-04T09:05:16Z" level=info msg="Try best to clean up the process for replica pvc-be17d93c-2730-482a-bd45-91a5671e06fb-r-f70ca683 in instance manager instance-manager-r-64b8070c" controller=longhorn-replica dataPath= node=ip-179-21-15-104 nodeID=ip-179-21-15-104 ownerID=ip-179-21-15-104 replica=pvc-be17d93c-2730-482a-bd45-91a5671e06fb-r-f70ca683
time="2022-12-04T09:05:16Z" level=debug msg="Prepare to delete instance pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-r-19798a1e"
time="2022-12-04T09:05:16Z" level=info msg="Try best to clean up the process for replica pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-r-19798a1e in instance manager instance-manager-r-64b8070c" controller=longhorn-replica dataPath= node=ip-179-21-15-104 nodeID=ip-179-21-15-104 ownerID=ip-179-21-15-104 replica=pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-r-19798a1e
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-be17d93c-2730-482a-bd45-91a5671e06fb-r-f70ca683\", UID:\"175a68b6-117d-4b72-b8a4-22b251934629\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115165\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-be17d93c-2730-482a-bd45-91a5671e06fb-r-f70ca683"
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-r-19798a1e\", UID:\"4d3cf89e-7440-497c-aa55-f535871e4d71\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115115\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-e3025f3b-2cb3-434f-992d-1150ea527bca-r-19798a1e"
time="2022-12-04T09:05:16Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-be17d93c-2730-482a-bd45-91a5671e06fb\", UID:\"8ae285c1-12c4-46fc-b904-8d8c5257a10e\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2115175\", FieldPath:\"\"}): type: 'Normal' reason: 'Remount' Volume pvc-be17d93c-2730-482a-bd45-91a5671e06fb requested remount at 2022-12-04T09:05:16Z"
time="2022-12-04T09:05:16Z" level=debug msg="Requeue orphan-ddadf4e22c2902e2031a9e8e3b3c725507eb810365e81540122a7192bc909a8a due to conflict" controller=longhorn-orphan error="Operation cannot be fulfilled on orphans.longhorn.io \"orphan-ddadf4e22c2902e2031a9e8e3b3c725507eb810365e81540122a7192bc909a8a\": the object has been modified; please apply your changes to the latest version and try again" node=ip-179-21-15-104 orphan=orphan-ddadf4e22c2902e2031a9e8e3b3c725507eb810365e81540122a7192bc909a8a
time="2022-12-04T09:05:16Z" level=warning msg="Instance pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-e-7c71be25 crashed on Instance Manager instance-manager-e-f821a48a at ip-179-21-15-104, try to get log"
time="2022-12-04T09:05:16Z" level=warning msg="cannot get crash log for instance pvc-3b62ad3f-d823-450f-bebf-f94d2983d179-e-7c71be25 on Instance Manager instance-manager-e-f821a48a at ip-179-21-15-104, error invalid Instance Manager instance-manager-e-f821a48a, state: error, IP: 10.42.3.128"
time="2022-12-04T09:05:16Z" level=debug msg="Instance Manager Controller ip-179-21-15-104 picked up instance-manager-r-64b8070c" controller=longhorn-instance-manager instanceManager=instance-manager-r-64b8070c node=ip-179-21-15-104 nodeID=ip-179-21-15-104
time="2022-12-04T09:05:16Z" level=info msg="Instance manager state is updated from unknown to error after syncing with the pod" controller=longhorn-instance-manager instanceManager=instance-manager-r-64b8070c node=ip-179-21-15-104 nodeID=ip-179-21-15-104
time="2022-12-04T09:05:16Z" level=info msg="Orphan Controller ip-179-21-15-104 picked up orphan-10660b7f7d02b524acb4f4db5b5b154939f92bab48a9f58468fbec678fcdee13" controller=longhorn-orphan node=ip-179-21-15-104 orphan=orphan-10660b7f7d02b524acb4f4db5b5b154939f92bab48a9f58468fbec678fcdee13
time="2022-12-04T09:05:16Z" level=info msg="Only delete orphan orphan-10660b7f7d02b524acb4f4db5b5b154939f92bab48a9f58468fbec678fcdee13 resource object and do not delete the orphaned data" controller=longhorn-orphan node=ip-179-21-15-104 orphan=orphan-10660b7f7d02b524acb4f4db5b5b154939f92bab48a9f58468fbec678fcdee13


time="2022-12-04T00:36:16Z" level=warning msg="Error syncing Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: cannot apply storage-network setting to Longhorn workloads when there are attached volumes" node=ip-179-21-15-104
time="2022-12-04T00:36:16Z" level=warning msg="Error syncing Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: cannot apply storage-network setting to Longhorn workloads when there are attached volumes" node=ip-179-21-15-104
time="2022-12-04T00:36:16Z" level=warning msg="Error syncing Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: cannot apply storage-network setting to Longhorn workloads when there are attached volumes" node=ip-179-21-15-104
E1204 00:36:16.670081       1 setting_controller.go:190] failed to sync setting for longhorn-system/storage-network: cannot apply storage-network setting to Longhorn workloads when there are attached volumes
time="2022-12-04T00:36:16Z" level=warning msg="Dropping Longhorn setting longhorn-system/storage-network out of the queue" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: cannot apply storage-network setting to Longhorn workloads when there are attached volumes" node=ip-179-21-15-104
W1204 00:37:57.595080       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W1204 00:43:49.497616       1 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob

Environment

  • Longhorn version: 1.3.2
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s (v1.23.6+k3s1)
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 2
  • Node config
    • OS type and version: Ubuntu 20.04
  • Number of Longhorn volumes in the cluster: 3

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 67 (25 by maintainers)

Most upvoted comments

Oooh. This looks interesting:

ganesha-2479[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection
2023-04-05T15:46:59.273750831+01:00 05/04/2023 14:46:59 : epoch 642d89e3 : share-manager-pvc-e2a556a4-ce12-484d-a0b6-6232b17fe6c8 : nfs-ganesha-2479[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection
2023-04-05T15:46:59.273757875+01:00 05/04/2023 14:46:59 : epoch 642d89e3 : share-manager-pvc-e2a556a4-ce12-484d-a0b6-6232b17fe6c8 : nfs-ganesha-2479[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection

Is nfs-ganesha not being given the permissions it needs re: DBUS?

Wow yeah, you had etcd blocked for more than two minutes on a fsync call (ext4_sync_file).

[Thu Dec  8 20:36:41 2022] INFO: task etcd:2094053 blocked for more than 122 seconds.
[Thu Dec  8 20:36:41 2022]       Tainted: G          I    X --------- ---  5.14.0-162.6.1.el9_1.x86_64 #1
[Thu Dec  8 20:36:41 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Dec  8 20:36:41 2022] task:etcd            state:D stack:    0 pid:2094053 ppid:2094011 flags:0x00000004
[Thu Dec  8 20:36:41 2022] Call Trace:
[Thu Dec  8 20:36:41 2022]  __schedule+0x206/0x580
[Thu Dec  8 20:36:41 2022]  schedule+0x43/0xa0
[Thu Dec  8 20:36:41 2022]  io_schedule+0x42/0x70
[Thu Dec  8 20:36:41 2022]  folio_wait_bit_common+0x147/0x380
[Thu Dec  8 20:36:41 2022]  ? find_get_pages_range_tag+0x199/0x1e0
[Thu Dec  8 20:36:41 2022]  ? file_fdatawait_range+0x20/0x20
[Thu Dec  8 20:36:41 2022]  folio_wait_writeback+0x28/0x80
[Thu Dec  8 20:36:41 2022]  __filemap_fdatawait_range+0x81/0xf0
[Thu Dec  8 20:36:41 2022]  ? __filemap_fdatawrite_range+0xc0/0xe0
[Thu Dec  8 20:36:41 2022]  file_write_and_wait_range+0x72/0xa0
[Thu Dec  8 20:36:41 2022]  ext4_sync_file+0x73/0x370 [ext4]
[Thu Dec  8 20:36:41 2022]  __x64_sys_fdatasync+0x43/0x80
[Thu Dec  8 20:36:41 2022]  do_syscall_64+0x59/0x90
[Thu Dec  8 20:36:41 2022]  ? exc_page_fault+0x62/0x150
[Thu Dec  8 20:36:41 2022]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

Etcd calls fsync to commit EVERY write so this will happen a lot. Note that if you have etcd and longhorn on the same backing disks, the etcd fsync calls will also need to flush the LH writes as well, so I would recommend against having them on the same physical disks. I would also try to confirm that you don’t have some problem with your disks or storage controller; even on a pathologically overloaded system I wouldn’t expect a fsync call to hang for that long unless something else is broken.

@brandond Do you know anything about the Antrea cni version 1.7 in combination with RKE1 usage?

I know nothing about either Antrea nor RKE1, sorry.

Prometheus is pretty IO intensive, if the minimal reproducer for you is installing Prometheus with a LH PVC this suggests that perhaps your IO throughput on the LH backing disks is insufficient.

Am I seeing this right that docker is crashing and taking down all the pods? It feels like a stable container runtime is a requirement for a functional LH cluster; if the container runtime is crashing and terminating pods (including LH) that seems like the first thing to investigate and resolve.

@derekbit

Spoken to fast, problem happened again even using longhorn version 1.3.0.

Dec 13 11:03:29 kube03 iscsid[1752]: iscsid: connect to 10.42.2.240:3260 failed (No route to host)
Dec 13 11:02:22 kube03 iscsid[1752]: iscsid: Kernel reported iSCSI connection 37:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)

I will try updating the system packages on all of our nodes

Upgrading:
 containerd.io                                            x86_64                                            1.6.12-3.1.el9                                            docker-ce-stable                                                          32 M
 tzdata                                                   noarch                                            2022g-1.el9_1                                             rhel-9-for-x86_64-baseos-rpms                                            839 k

@derekbit By the way, here’s the result of dmesg -T --since "2022-12-10": https://gist.github.com/iMikeG6/e1767a3cae2c4f9b2730b2c3126dd58a

dockerd.service doesn’t have any interesting logs nor contain errors. Same for journalctl --since "2022-12-11" Nothing interesting this time.

@proteche I suggest you to use https://gist.github.com/ to upload your logs… It will make this issue way more readable than what it is now, way to much lines…

Anyway, did you tried to rollback on the update or tried to installed longhorn version 1.3.0 ? For us, it seams like it helped

@derekbit Two days without the reported error below, with longhorn version 1.3.0 (helm chart version: 100.2.0+up1.3.0): scsid: connect to 10.42.0.142:3260 failed (No route to host) and iscsid: Kernel reported iSCSI connection 43:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)

So there must be something in the image used for the new version. As we are using RedHat (9.1) and rancher use Suse, it could be related to an incompatibility between the two system. We have seen that in our old cluster that ran RedHat 8 and a container that ran ubuntu base.

Now, the only errors logs reported are those two lines:

kube01 systemd[1]: kubepods-burstable-pod6c3a31d5_a6d4_4760_9548_1f6052cce934.slice: Failed to open /run/systemd/transient/kubepods-burstable-pod6c3a31d5_a6d4_4760_9548_1f6052cce934.slice: No such file or directory
kube01 kernel: iptables[3853612]: segfault at 80 ip 00007f1bfb5f2964 sp 00007fff90e741a8 error 4 in libnftnl.so.11.2.0[7f1bfb5ed000+19000]

I’m not even sure that those two errors are linked to any rancher services.

10.42.0.238:

@iMikeG6 Can you send us the latest support bundle? I’d like to check it with the latest dmesg. Thank you. @derekbit

I Just sent you the new bundle.

Problem seems to happen every evening, between 08pm and 02am. This time it happened on all of our workers (3 physical machines)

New errors appears, below, it’s just one line but we have several of them in our log file.

systemd[1]: kubepods-burstable-pode7d56a41_9bc6_4ca5_b7ef_85ab4f5b8211.slice: Failed to open /run/systemd/transient/kubepods-burstable-pode7d56a41_9bc6_4ca5_b7ef_85ab4f5b8211.slice: No such file or directory

Along with the error reported here iscsid: connect to 10.42.0.238:3260 failed (No route to host) and Kernel reported iSCSI connection 20:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)

According to our monitoring system, on one of our worker, IO wait started to increase I/O Wait: 48% for a few minutes.

Interesting logs in one of our 3 workers nodes were found:

Dec  8 20:34:51 kube03 iscsid[1752]: iscsid: Kernel reported iSCSI connection 12:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Dec  8 20:34:38 kube03 systemd[1]: docker.service: Failed with result 'exit-code'.
Dec  8 20:34:51 kube03 iscsid[1752]: iscsid: IPC qtask write failed: Broken pipe
Dec  8 20:35:03 kube03 firewalld[1398]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t nat -D OUTPUT -m addrtype --dst-type LOCAL -j DOCKER' failed: iptables: Bad rule (does a matching rule exist in that chain?).
Dec  8 20:35:03 kube03 firewalld[1398]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t nat -D PREROUTING' failed: iptables: Bad rule (does a matching rule exist in that chain?).
Dec  8 20:35:03 kube03 firewalld[1398]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t nat -D OUTPUT' failed: iptables: Bad rule (does a matching rule exist in that chain?).
Dec  8 20:35:03 kube03 firewalld[1398]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -X DOCKER' failed: iptables v1.8.8 (nf_tables):  CHAIN_DEL failed (Device or resource busy): chain DOCKER
Dec  8 20:35:03 kube03 firewalld[1398]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -X DOCKER-ISOLATION-STAGE-1' failed: iptables v1.8.8 (nf_tables):  CHAIN_DEL failed (Device or resource busy): chain DOCKER-ISOLATION-STAGE-1
Dec  8 20:35:03 kube03 firewalld[1398]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -F DOCKER-ISOLATION' failed: iptables: No chain/target/match by that name.
Dec  8 20:35:03 kube03 firewalld[1398]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -X DOCKER-ISOLATION' failed: iptables: No chain/target/match by that name.
Dec  8 20:35:03 kube03 firewalld[1398]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -D FORWARD -i docker0 -o docker0 -j DROP' failed: iptables: Bad rule (does a matching rule exist in that chain?).
Dec  8 20:35:08 kube03 firewalld[1398]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -D FORWARD -i docker0 -o docker0 -j DROP' failed: iptables: Bad rule (does a matching rule exist in that chain?).

From the node were iptables failed I found this in dmesg -T

[Thu Dec  8 22:00:01 2022] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=docker-ea60336de4ed9e01facbdfd8b5f76c27dc65aa771b9f584477dcee05033f0056.scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod87aa6d9d_7ebe_407b_9b67_b92d9063d1a5.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod87aa6d9d_7ebe_407b_9b67_b92d9063d1a5.slice/docker-ea60336de4ed9e01facbdfd8b5f76c27dc65aa771b9f584477dcee05033f0056.scope,task=prometheus,pid=2960544,uid=65534
[Thu Dec  8 22:00:01 2022] Memory cgroup out of memory: Killed process 2960544 (prometheus) total-vm:4101332kB, anon-rss:1041520kB, file-rss:71064kB, shmem-rss:0kB, UID:65534 pgtables:2772kB oom_score_adj:995

Here the full gist file with the dmesg logs: https://gist.github.com/iMikeG6/f9df95394426ad6a68e44263e3132d8a

The current workload of this cluster is not really high, but we were planing on deploying a new production cluster based on this one, but in regards of this issue, we may have to re-think about our storage solution and have a look at Rook Ceph if this problem happen again and again.

Hope I can provide you as much information as I can so it can be fixed soon.