longhorn: [BUG] Delete the workload pod of a failed engine automatically to trigger recovery
Describe the bug This happens a lot with Prometheus PVC after some network issues. Longhorn failed to write to replica due to timeout errors. When network becomes ok longhorn salvaging pvc and it becomes healthy in longhorn UI. But currently working workload can’t use it any more. When you trying to restart workload pod gets stuck in terminating state. PVC detaching takes about 5 minutes but after that pod still stuck in terminating state.
To Reproduce I don’t really know how to reproduce it.
Expected behavior A clear and concise description of what you expected to happen.
Log
This shows in PVC events with Type: Warning and Reason: Remount
cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy. , error exit status 32
kubelet logs about pvc
I0825 03:13:35.459247 2685 reconciler.go:183] operationExecutor.UnmountVolume started for volume "prometheus-prometheus-operator-prometheus-db" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a") pod "66e0b868-aad8-4da8-a6ad-0b329851500f" (UID: "66e0b868-aad8-4da8-a6ad-0b329851500f")
E0825 03:13:35.471022 2685 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/driver.longhorn.io^pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\" (\"66e0b868-aad8-4da8-a6ad-0b329851500f\")" failed. No retries permitted until 2020-08-25 03:15:37.470937411 +0000 UTC m=+681459.965365562 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"prometheus-prometheus-operator-prometheus-db\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\") pod \"66e0b868-aad8-4da8-a6ad-0b329851500f\" (UID: \"66e0b868-aad8-4da8-a6ad-0b329851500f\") : kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = Internal desc = Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount\nOutput: umount: /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n\n"
kubelet logs about pod
I0825 05:59:33.445991 2685 kubelet.go:1951] SyncLoop (PLEG): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)", event: &pleg.PodLifecycleEvent{ID:"fce7f478-38bd-419c-912d-f28ccf994fd3", Type:"ContainerDied", Data:"a997a4290e6e339935cd01443306c138f6eb1f870f6650d0a25044093be1aa51"}
I0825 05:59:35.472191 2685 kubelet.go:1984] SyncLoop (container unhealthy): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)"
I0825 05:59:35.634802 2685 kubelet.go:1951] SyncLoop (PLEG): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)", event: &pleg.PodLifecycleEvent{ID:"fce7f478-38bd-419c-912d-f28ccf994fd3", Type:"ContainerStarted", Data:"3e3f39ca36139ea1c8a0e865cf9106bfadac5d8fc4e4d9b8d1a8fffb5b8bab10"}
I0825 05:59:40.055585 2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): HTTP probe failed with statuscode: 503
I0825 05:59:45.055394 2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): HTTP probe failed with statuscode: 503
I0825 05:59:50.055010 2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): HTTP probe failed with statuscode: 503
I0825 06:01:52.704263 2685 kubelet.go:1922] SyncLoop (DELETE, "api"): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)"
I0825 06:01:53.660145 2685 kubelet.go:1951] SyncLoop (PLEG): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)", event: &pleg.PodLifecycleEvent{ID:"fce7f478-38bd-419c-912d-f28ccf994fd3", Type:"ContainerDied", Data:"83ad67b2fe8cf5fe79f9e7199eda6fb95c922d056e599a9c8e17f37d16eb2966"}
I0825 06:01:53.660387 2685 kubelet.go:1951] SyncLoop (PLEG): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)", event: &pleg.PodLifecycleEvent{ID:"fce7f478-38bd-419c-912d-f28ccf994fd3", Type:"ContainerDied", Data:"010a2be56e8ff7d5786204f6be7bf15ed0a28da611e7b242cbe5c4c8a59787e8"}
I0825 06:01:55.054558 2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:01:55.472496 2685 prober.go:116] Liveness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/healthy: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:00.054414 2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:00.472968 2685 prober.go:116] Liveness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/healthy: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:05.054707 2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:05.472702 2685 prober.go:116] Liveness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/healthy: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:10.054739 2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:10.472452 2685 prober.go:116] Liveness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/healthy: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:15.054603 2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:15.472592 2685 prober.go:116] Liveness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/healthy: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:20.054817 2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:20.472761 2685 prober.go:116] Liveness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/healthy: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:20.472921 2685 kubelet.go:1984] SyncLoop (container unhealthy): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)"
I0825 06:02:25.054670 2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:30.054602 2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:35.054744 2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:40.054919 2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:45.054694 2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:47.984359 2685 kubelet.go:1951] SyncLoop (PLEG): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)", event: &pleg.PodLifecycleEvent{ID:"fce7f478-38bd-419c-912d-f28ccf994fd3", Type:"ContainerDied", Data:"3e3f39ca36139ea1c8a0e865cf9106bfadac5d8fc4e4d9b8d1a8fffb5b8bab10"}
I0825 06:02:47.984547 2685 kubelet.go:1951] SyncLoop (PLEG): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)", event: &pleg.PodLifecycleEvent{ID:"fce7f478-38bd-419c-912d-f28ccf994fd3", Type:"ContainerDied", Data:"e6b5395cb43bdbbb5a68d704a1571cff82cf2a0be5114f488c8896f0ce6b3b3e"}
longhorm-manager logs (stripped for clarity)
time="2020-08-24T20:31:53Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\", UID:\"840aaec6-6ebf-42ca-985b-86268ba6d890\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908727\", FieldPath:\"\"}): type: 'Warning' reason: 'Faulted' Detected replica pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-r-25a8ed78 (172.20.3.40:10015) in error"
time="2020-08-24T20:31:56Z" level=warning msg="Try to get requested log for pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 on node prod-cluster-node-1"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31416 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Retry 1 on replica 172.20.3.40:10016 seq= 31416 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31417 size= 332 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Retry 1 on replica 172.20.3.40:10016 seq= 31417 size= 332 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31418 size= 1024 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Retry 1 on replica 172.20.3.40:10016 seq= 31418 size= 1024 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31419 size= 1280 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Retry 1 on replica 172.20.3.40:10016 seq= 31419 size= 1280 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:37Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31420 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:37Z\" level=error msg=\"Retry 1 on replica 172.20.3.40:10016 seq= 31420 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31423 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Retry 2 on replica 172.20.3.40:10016 seq= 31423 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31424 size= 332 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Retry 2 on replica 172.20.3.40:10016 seq= 31424 size= 332 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31425 size= 1024 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Retry 2 on replica 172.20.3.40:10016 seq= 31425 size= 1024 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31426 size= 1280 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Retry 2 on replica 172.20.3.40:10016 seq= 31426 size= 1280 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:41Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31427 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:41Z\" level=error msg=\"Retry 2 on replica 172.20.3.40:10016 seq= 31427 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31430 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=error msg=\"Setting replica tcp://172.20.3.40:10015 to ERR due to: r/w timeout\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=info msg=\"Set replica tcp://172.20.3.40:10015 to mode ERR\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=error msg=\"I/O error: tcp://172.20.3.40:10015: r/w timeout\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=info msg=\"Monitoring stopped tcp://172.20.3.40:10015\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=error msg=\"Setting replica tcp://172.20.3.40:10015 to ERR due to: r/w timeout\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=info msg=\"Ignore set replica tcp://172.20.3.40:10015 to mode ERR due to it's ERR\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=error msg=\"I/O error: tcp://172.20.3.40:10015: r/w timeout\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:37.131954: ->172.20.3.40:10016 W[ 16kB] 7705959us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.837311: ->172.20.3.40:10016 W[ 16kB] 8000610us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.849008: ->172.20.3.40:10016 W[1280kB] 3988915us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:41.132211: ->172.20.3.40:10016 W[ 16kB] 3705714us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.838137: ->172.20.3.40:10016 W[332kB] 7999791us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.846568: ->172.20.3.40:10016 W[1024kB] 7991362us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.837592: ->172.20.3.40:10016 W[ 16kB] 4000339us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:33.131624: ->172.20.3.40:10016 W[ 16kB] 11706309us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.838397: ->172.20.3.40:10016 W[332kB] 3999537us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.846841: ->172.20.3.40:10016 W[1024kB] 3991096us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.848572: ->172.20.3.40:10016 W[1280kB] 7989367us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:41.132211: ->172.20.3.40:10016 W[ 16kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.849008: ->172.20.3.40:10016 W[1280kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:33.131624: ->172.20.3.40:10016 W[ 16kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.838137: ->172.20.3.40:10016 W[332kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.848572: ->172.20.3.40:10016 W[1280kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:37.131954: ->172.20.3.40:10016 W[ 16kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.846841: ->172.20.3.40:10016 W[1024kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.837592: ->172.20.3.40:10016 W[ 16kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.838397: ->172.20.3.40:10016 W[332kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.846568: ->172.20.3.40:10016 W[1024kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.837311: ->172.20.3.40:10016 W[ 16kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=error msg=\"I/O error: No backend available\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:46Z\" level=error msg=\"I/O error: No backend available\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:46Z\" level=error msg=\"I/O error: No backend available\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:46Z\" level=error msg=\"I/O error: No backend available\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:55Z\" level=error msg=\"Error reading from wire: EOF\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:57Z\" level=warning msg=\"Received signal interrupt to shutdown\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:57Z\" level=warning msg=\"Starting to execute registered shutdown func github.com/longhorn/longhorn-engine/app/cmd.startController.func1\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:57Z\" level=info msg=\"Shutdown SCSI device for 172.20.6.51:iqn.2019-10.io.longhorn:pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:57Z\" level=info msg=\"Shutdown SCSI target iqn.2019-10.io.longhorn:pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:57Z\" level=warning msg=\"Socket server connection closed\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:57Z\" level=info msg=\"device pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a: SCSI device /dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a shutdown\""
time="2020-08-24T20:31:57Z" level=debug msg="Prepare to delete instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\", UID:\"840aaec6-6ebf-42ca-985b-86268ba6d890\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908818\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=debug msg="Prepare to delete instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\", UID:\"840aaec6-6ebf-42ca-985b-86268ba6d890\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908829\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=debug msg="Prepare to delete instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\", UID:\"840aaec6-6ebf-42ca-985b-86268ba6d890\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908831\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=debug msg="Prepare to delete instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\", UID:\"840aaec6-6ebf-42ca-985b-86268ba6d890\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908833\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=debug msg="Requeue longhorn-system/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a due to error <nil> or Operation cannot be fulfilled on engines.longhorn.io \"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\": the object has been modified; please apply your changes to the latest version and try again"
time="2020-08-24T20:31:57Z" level=warning msg="Try to get requested log for pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 on node prod-cluster-node-1"
time="2020-08-24T20:31:57Z" level=error msg="failed to receive log for instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: rpc error: code = NotFound desc = cannot find process pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=warning msg="cannot get requested log for instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 on node prod-cluster-node-1, error rpc error: code = NotFound desc = cannot find process pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=debug msg="Prepare to delete instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\", UID:\"840aaec6-6ebf-42ca-985b-86268ba6d890\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908838\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
E0824 20:31:57.986382 1 engine_controller.go:667] fail to update status for engine pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: failed to list replicas from controller 'pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a': Failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.0.2/longhorn [--url 172.20.6.51:10000 ls], output , stderr, time="2020-08-24T20:31:57Z" level=fatal msg="Error running ls command: failed to list replicas for volume 172.20.6.51:10000: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.20.6.51:10000: connect: connection refused\""
time="2020-08-24T20:31:58Z" level=debug msg="Instance handler updated instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 state, old state running, new state stopped"
time="2020-08-24T20:31:58Z" level=debug msg="Stop monitoring pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:58Z" level=debug msg="Requeue longhorn-system/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a due to error <nil> or Operation cannot be fulfilled on engines.longhorn.io \"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\": the object has been modified; please apply your changes to the latest version and try again"
time="2020-08-24T20:31:58Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908803\", FieldPath:\"\"}): type: 'Normal' reason: 'Detached' volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a has been detached"
time="2020-08-24T20:31:59Z" level=debug msg="Requeue longhorn-system/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 due to conflict: Operation cannot be fulfilled on engines.longhorn.io \"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\": the object has been modified; please apply your changes to the latest version and try again"
time="2020-08-24T20:31:59Z" level=warning msg="Replica pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-r-25a8ed78 of volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a has been automatically salvaged"
time="2020-08-24T20:31:59Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908868\", FieldPath:\"\"}): type: 'Warning' reason: 'AutoSalvaged' Replica pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-r-25a8ed78 of volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a has been automatically salvaged"
time="2020-08-24T20:32:01Z" level=debug msg="Prepare to create instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:32:01Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\", UID:\"840aaec6-6ebf-42ca-985b-86268ba6d890\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908902\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:32:02Z" level=debug msg="Instance process pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 had been created, need to wait for instance manager update"
time="2020-08-24T20:32:02Z" level=debug msg="Instance handler updated instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 state, old state stopped, new state starting"
time="2020-08-24T20:32:04Z" level=debug msg="Instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 starts running, IP 172.20.6.51"
time="2020-08-24T20:32:04Z" level=debug msg="Instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 starts running, Port 10000"
time="2020-08-24T20:32:04Z" level=debug msg="Instance handler updated instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 state, old state starting, new state running"
time="2020-08-24T20:32:04Z" level=debug msg="Start monitoring pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:32:17Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/b658164e-5e79-41af-9449-575f7c45181f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:32:18Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:32:18Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/fce7f478-38bd-419c-912d-f28ccf994fd3/volume-subpaths/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/prometheus/2"
time="2020-08-24T20:32:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908891\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n, error exit status 32"
time="2020-08-24T20:32:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908891\", FieldPath:\"\"}): type: 'Normal' reason: 'Attached' volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a has been attached to prod-cluster-node-1"
time="2020-08-24T20:32:18Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/fce7f478-38bd-419c-912d-f28ccf994fd3/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:32:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24909025\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n, error exit status 32"
time="2020-08-24T20:32:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24909029\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n, error exit status 32"
time="2020-08-24T20:32:33Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24909029\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n, error exit status 32"
time="2020-08-24T20:32:44Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/b658164e-5e79-41af-9449-575f7c45181f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:32:45Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:32:45Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24909029\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n, error exit status 32"
time="2020-08-24T20:33:03Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/fce7f478-38bd-419c-912d-f28ccf994fd3/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:33:03Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:33:03Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24909029\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n, error exit status 32"
time="2020-08-24T20:33:14Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:33:14Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24909029\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n, error exit status 32"
Also there are duplicated mounts on the node from previous pods mount | grep pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (ro,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/b658164e-5e79-41af-9449-575f7c45181f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (ro,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/b658164e-5e79-41af-9449-575f7c45181f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (ro,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/b658164e-5e79-41af-9449-575f7c45181f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (ro,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (ro,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (ro,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/fce7f478-38bd-419c-912d-f28ccf994fd3/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (ro,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/fce7f478-38bd-419c-912d-f28ccf994fd3/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/fce7f478-38bd-419c-912d-f28ccf994fd3/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/b658164e-5e79-41af-9449-575f7c45181f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
Environment:
- Longhorn version: 1.0.2
- Kubernetes version: v1.17.4
- Node OS type and version: Ubuntu 18.04
Additional context Sometimes prometheus restarts with another data in pvc. Looks like it somehow switch pvc or its content. It happens then prometheus liveness probe failed during this remount errors and gets restarted by k8s. You can see that in all dashboards. For example, when all is ok metrics looks fine. When this happens all metrics that you saw before got disappeared and new metrics successfully saved in this strange pvc. If you manually restart (delete) pod with prometheus all becomes fine. All old metrics appears as they should but metrics that has been saved during “pvc switching” disappears. Don’t know how to describe this better.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 23 (17 by maintainers)
Commits related to this issue
- add LEP for feature `handling volume remount by killing the consuming pod` Longhorn #1719 Signed-off-by: Phan Le <phan.le@rancher.com> — committed to PhanLe1010/longhorn by PhanLe1010 4 years ago
- add LEP for feature `handling volume remount by killing the consuming pod` Longhorn #1719 Signed-off-by: Phan Le <phan.le@rancher.com> — committed to PhanLe1010/longhorn by PhanLe1010 4 years ago
- add LEP for feature `handling volume remount by killing the consuming pod` Longhorn #1719 Signed-off-by: Phan Le <phan.le@rancher.com> — committed to PhanLe1010/longhorn by PhanLe1010 4 years ago
- add LEP for feature `handling volume remount by killing the consuming pod` Longhorn #1719 Signed-off-by: Phan Le <phan.le@rancher.com> — committed to PhanLe1010/longhorn by PhanLe1010 4 years ago
- add LEP for feature `handling volume remount by killing the consuming pod` Longhorn #1719 Signed-off-by: Phan Le <phan.le@rancher.com> — committed to PhanLe1010/longhorn by PhanLe1010 4 years ago
- add LEP for feature `handling volume remount by killing the consuming pod` Longhorn #1719 Signed-off-by: Phan Le <phan.le@rancher.com> — committed to longhorn/longhorn by PhanLe1010 4 years ago
Some of my thoughts about the above issue. Feel free to correct me though 😃
How does
Longhorn auto remountfeature works incorrectly when pod using subpath?Let’s follow the steps in case-1 in the above setup. Stop at step 1 when nothing bad happened yet. If we ssh into
node-1and runfindmntwe will get something like this:At this moment, there are 2 mount points corresponding to the PVC:
.../volumes/.../mountand.../volume-subpaths/.../nginx/0. The.../volume-subpaths/.../nginx/0is actually a bind mount point corresponding to directory/htmlinside the filesystem at the mount point.../volumes/.../mount. This means that if we mount anything to.../volume-subpaths/.../nginx/0it will propagate to.../volumes/.../mount/htmlAt step 2, when we kill replica the instance manager pod, the volume becomes faulty. Longhorn will auto salvage the volume by detaching, does some work, attaching, and auto remounting. When the volume is attached to
node-1, it appear to have the same name under/dev/longhorn/volume-name, but it is actually a new block device with a newmajor-minornumber. When Longhorn does auto remounting, Longhorn loops over all existing mount points,umountn-1 layer for the point which has n layers (n>1) thenmountthe new block device at that mount point.If we run
findmntwe will get this output:Notice that there are not 4 but 5 mount points now even though Longhorn only did 2 additional
mountcommands (one for.../volumes/.../mountand.../volume-subpaths/.../nginx/0). The reason is that.../volume-subpaths/.../nginx/0is actually a bind mount point corresponding to directory/htmlinside the filesystem at the mount point.../volumes/.../mount. This means that if we mount anything to.../volume-subpaths/.../nginx/0it will propagate to.../volumes/.../mount/htmlNow if we try to
umount .../volumes/.../mount, we will get thetarget busyerror because there is a nested mount point.../volumes/.../mount/html.Moreover, Longhorn auto remounted the root filesystem of
vol-1at.../volume-subpaths/.../nginx/0which is incorrect.Reproduce steps:
longhorn-no-ha:Case 1: Disconnected Once
node-1, and the pod ispod-1pod-1, create a filetest_data.txtinside the folder/usr/share/nginx/htmlnode-1. This action simulates a network disconnection (the engine process of the PVC cannot talk to the replica process on the killed replica instance manager pod).vol-1) has only 1 replica. The volume becomes faulty. Longhorn will auto salvage the volume. After that, Longhorn will auto remount the volume on previous mount points. (The auto remount work incorrectly in this case becausepod-1is using subpath. For more explanation of how it works incorrectly, see below)vol-1to become healthy. Exec into thepod-1. Runls /usr/share/nginx/html. Verify that you get this error: This is a signal that the underline filesystem no longer exists. This error again confirms that Longhorn auto remount worked incorrectly.pod-1. Verify the pod is killed successfullypod-1running. Exec into thepod-1. Runls usr/share/nginx/html. Verify that you don’t get any error. Everything is back to normal.Case 2: Disconnected Twice
node-1.vol-1to become healthy.node-1again.vol-1to become healthy. Exec into thepod-1. Runls /usr/share/nginx/html. Verify that you get this error:pod-1. Verify the pod stuck in terminating state forever.longhorn-csi-pluginonnode-1, we can see this error:@shuo-wu Yes, might make sense to kill the pod instead. Ideally we should
restartthe pod but there isn’t a way to do it in Kubernetes as I know.