longhorn: [BUG] Delete the workload pod of a failed engine automatically to trigger recovery

Describe the bug This happens a lot with Prometheus PVC after some network issues. Longhorn failed to write to replica due to timeout errors. When network becomes ok longhorn salvaging pvc and it becomes healthy in longhorn UI. But currently working workload can’t use it any more. When you trying to restart workload pod gets stuck in terminating state. PVC detaching takes about 5 minutes but after that pod still stuck in terminating state.

To Reproduce I don’t really know how to reproduce it.

Expected behavior A clear and concise description of what you expected to happen.

Log This shows in PVC events with Type: Warning and Reason: Remount

cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy. , error exit status 32

kubelet logs about pvc

I0825 03:13:35.459247    2685 reconciler.go:183] operationExecutor.UnmountVolume started for volume "prometheus-prometheus-operator-prometheus-db" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a") pod "66e0b868-aad8-4da8-a6ad-0b329851500f" (UID: "66e0b868-aad8-4da8-a6ad-0b329851500f")
E0825 03:13:35.471022    2685 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/driver.longhorn.io^pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\" (\"66e0b868-aad8-4da8-a6ad-0b329851500f\")" failed. No retries permitted until 2020-08-25 03:15:37.470937411 +0000 UTC m=+681459.965365562 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"prometheus-prometheus-operator-prometheus-db\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\") pod \"66e0b868-aad8-4da8-a6ad-0b329851500f\" (UID: \"66e0b868-aad8-4da8-a6ad-0b329851500f\") : kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = Internal desc = Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount\nOutput: umount: /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n\n"

kubelet logs about pod

I0825 05:59:33.445991    2685 kubelet.go:1951] SyncLoop (PLEG): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)", event: &pleg.PodLifecycleEvent{ID:"fce7f478-38bd-419c-912d-f28ccf994fd3", Type:"ContainerDied", Data:"a997a4290e6e339935cd01443306c138f6eb1f870f6650d0a25044093be1aa51"}
I0825 05:59:35.472191    2685 kubelet.go:1984] SyncLoop (container unhealthy): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)"
I0825 05:59:35.634802    2685 kubelet.go:1951] SyncLoop (PLEG): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)", event: &pleg.PodLifecycleEvent{ID:"fce7f478-38bd-419c-912d-f28ccf994fd3", Type:"ContainerStarted", Data:"3e3f39ca36139ea1c8a0e865cf9106bfadac5d8fc4e4d9b8d1a8fffb5b8bab10"}
I0825 05:59:40.055585    2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): HTTP probe failed with statuscode: 503
I0825 05:59:45.055394    2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): HTTP probe failed with statuscode: 503
I0825 05:59:50.055010    2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): HTTP probe failed with statuscode: 503
I0825 06:01:52.704263    2685 kubelet.go:1922] SyncLoop (DELETE, "api"): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)"
I0825 06:01:53.660145    2685 kubelet.go:1951] SyncLoop (PLEG): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)", event: &pleg.PodLifecycleEvent{ID:"fce7f478-38bd-419c-912d-f28ccf994fd3", Type:"ContainerDied", Data:"83ad67b2fe8cf5fe79f9e7199eda6fb95c922d056e599a9c8e17f37d16eb2966"}
I0825 06:01:53.660387    2685 kubelet.go:1951] SyncLoop (PLEG): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)", event: &pleg.PodLifecycleEvent{ID:"fce7f478-38bd-419c-912d-f28ccf994fd3", Type:"ContainerDied", Data:"010a2be56e8ff7d5786204f6be7bf15ed0a28da611e7b242cbe5c4c8a59787e8"}
I0825 06:01:55.054558    2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:01:55.472496    2685 prober.go:116] Liveness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/healthy: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:00.054414    2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:00.472968    2685 prober.go:116] Liveness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/healthy: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:05.054707    2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:05.472702    2685 prober.go:116] Liveness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/healthy: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:10.054739    2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:10.472452    2685 prober.go:116] Liveness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/healthy: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:15.054603    2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:15.472592    2685 prober.go:116] Liveness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/healthy: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:20.054817    2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:20.472761    2685 prober.go:116] Liveness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/healthy: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:20.472921    2685 kubelet.go:1984] SyncLoop (container unhealthy): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)"
I0825 06:02:25.054670    2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:30.054602    2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:35.054744    2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:40.054919    2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:45.054694    2685 prober.go:116] Readiness probe for "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3):prometheus" failed (failure): Get http://172.20.6.126:9090/-/ready: dial tcp 172.20.6.126:9090: connect: connection refused
I0825 06:02:47.984359    2685 kubelet.go:1951] SyncLoop (PLEG): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)", event: &pleg.PodLifecycleEvent{ID:"fce7f478-38bd-419c-912d-f28ccf994fd3", Type:"ContainerDied", Data:"3e3f39ca36139ea1c8a0e865cf9106bfadac5d8fc4e4d9b8d1a8fffb5b8bab10"}
I0825 06:02:47.984547    2685 kubelet.go:1951] SyncLoop (PLEG): "prometheus-prometheus-operator-prometheus-0_monitoring(fce7f478-38bd-419c-912d-f28ccf994fd3)", event: &pleg.PodLifecycleEvent{ID:"fce7f478-38bd-419c-912d-f28ccf994fd3", Type:"ContainerDied", Data:"e6b5395cb43bdbbb5a68d704a1571cff82cf2a0be5114f488c8896f0ce6b3b3e"}

longhorm-manager logs (stripped for clarity)

time="2020-08-24T20:31:53Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\", UID:\"840aaec6-6ebf-42ca-985b-86268ba6d890\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908727\", FieldPath:\"\"}): type: 'Warning' reason: 'Faulted' Detected replica pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-r-25a8ed78 (172.20.3.40:10015) in error"
time="2020-08-24T20:31:56Z" level=warning msg="Try to get requested log for pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 on node prod-cluster-node-1"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31416 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Retry  1 on replica 172.20.3.40:10016 seq= 31416 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31417 size= 332 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Retry  1 on replica 172.20.3.40:10016 seq= 31417 size= 332 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31418 size= 1024 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Retry  1 on replica 172.20.3.40:10016 seq= 31418 size= 1024 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31419 size= 1280 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:36Z\" level=error msg=\"Retry  1 on replica 172.20.3.40:10016 seq= 31419 size= 1280 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:37Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31420 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:37Z\" level=error msg=\"Retry  1 on replica 172.20.3.40:10016 seq= 31420 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31423 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Retry  2 on replica 172.20.3.40:10016 seq= 31423 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31424 size= 332 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Retry  2 on replica 172.20.3.40:10016 seq= 31424 size= 332 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31425 size= 1024 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Retry  2 on replica 172.20.3.40:10016 seq= 31425 size= 1024 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31426 size= 1280 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:40Z\" level=error msg=\"Retry  2 on replica 172.20.3.40:10016 seq= 31426 size= 1280 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:41Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31427 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:41Z\" level=error msg=\"Retry  2 on replica 172.20.3.40:10016 seq= 31427 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=error msg=\"Write timeout on replica 172.20.3.40:10016 seq= 31430 size= 16 (kB)\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=error msg=\"Setting replica tcp://172.20.3.40:10015 to ERR due to: r/w timeout\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=info msg=\"Set replica tcp://172.20.3.40:10015 to mode ERR\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=error msg=\"I/O error: tcp://172.20.3.40:10015: r/w timeout\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=info msg=\"Monitoring stopped tcp://172.20.3.40:10015\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=error msg=\"Setting replica tcp://172.20.3.40:10015 to ERR due to: r/w timeout\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=info msg=\"Ignore set replica tcp://172.20.3.40:10015 to mode ERR due to it's ERR\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=error msg=\"I/O error: tcp://172.20.3.40:10015: r/w timeout\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:37.131954: ->172.20.3.40:10016 W[ 16kB]  7705959us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.837311: ->172.20.3.40:10016 W[ 16kB]  8000610us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.849008: ->172.20.3.40:10016 W[1280kB]  3988915us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:41.132211: ->172.20.3.40:10016 W[ 16kB]  3705714us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.838137: ->172.20.3.40:10016 W[332kB]  7999791us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.846568: ->172.20.3.40:10016 W[1024kB]  7991362us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.837592: ->172.20.3.40:10016 W[ 16kB]  4000339us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:33.131624: ->172.20.3.40:10016 W[ 16kB] 11706309us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.838397: ->172.20.3.40:10016 W[332kB]  3999537us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.846841: ->172.20.3.40:10016 W[1024kB]  3991096us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.848572: ->172.20.3.40:10016 W[1280kB]  7989367us failed"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:41.132211: ->172.20.3.40:10016 W[ 16kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.849008: ->172.20.3.40:10016 W[1280kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:33.131624: ->172.20.3.40:10016 W[ 16kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.838137: ->172.20.3.40:10016 W[332kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.848572: ->172.20.3.40:10016 W[1280kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:37.131954: ->172.20.3.40:10016 W[ 16kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.846841: ->172.20.3.40:10016 W[1024kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.837592: ->172.20.3.40:10016 W[ 16kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:40.838397: ->172.20.3.40:10016 W[332kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.846568: ->172.20.3.40:10016 W[1024kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: Aug 24 20:31:36.837311: ->172.20.3.40:10016 W[ 16kB] pending"
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:44Z\" level=error msg=\"I/O error: No backend available\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:46Z\" level=error msg=\"I/O error: No backend available\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:46Z\" level=error msg=\"I/O error: No backend available\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:46Z\" level=error msg=\"I/O error: No backend available\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:55Z\" level=error msg=\"Error reading from wire: EOF\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:57Z\" level=warning msg=\"Received signal interrupt to shutdown\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:57Z\" level=warning msg=\"Starting to execute registered shutdown func github.com/longhorn/longhorn-engine/app/cmd.startController.func1\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:57Z\" level=info msg=\"Shutdown SCSI device for 172.20.6.51:iqn.2019-10.io.longhorn:pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:57Z\" level=info msg=\"Shutdown SCSI target iqn.2019-10.io.longhorn:pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:57Z\" level=warning msg=\"Socket server connection closed\""
time="2020-08-24T20:31:57Z" level=warning msg="pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: time=\"2020-08-24T20:31:57Z\" level=info msg=\"device pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a: SCSI device /dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a shutdown\""
time="2020-08-24T20:31:57Z" level=debug msg="Prepare to delete instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\", UID:\"840aaec6-6ebf-42ca-985b-86268ba6d890\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908818\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=debug msg="Prepare to delete instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\", UID:\"840aaec6-6ebf-42ca-985b-86268ba6d890\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908829\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=debug msg="Prepare to delete instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\", UID:\"840aaec6-6ebf-42ca-985b-86268ba6d890\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908831\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=debug msg="Prepare to delete instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\", UID:\"840aaec6-6ebf-42ca-985b-86268ba6d890\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908833\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=debug msg="Requeue longhorn-system/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a due to error <nil> or Operation cannot be fulfilled on engines.longhorn.io \"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\": the object has been modified; please apply your changes to the latest version and try again"
time="2020-08-24T20:31:57Z" level=warning msg="Try to get requested log for pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 on node prod-cluster-node-1"
time="2020-08-24T20:31:57Z" level=error msg="failed to receive log for instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: rpc error: code = NotFound desc = cannot find process pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=warning msg="cannot get requested log for instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 on node prod-cluster-node-1, error rpc error: code = NotFound desc = cannot find process pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=debug msg="Prepare to delete instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:57Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\", UID:\"840aaec6-6ebf-42ca-985b-86268ba6d890\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908838\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
E0824 20:31:57.986382       1 engine_controller.go:667] fail to update status for engine pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9: failed to list replicas from controller 'pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a': Failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.0.2/longhorn [--url 172.20.6.51:10000 ls], output , stderr, time="2020-08-24T20:31:57Z" level=fatal msg="Error running ls command: failed to list replicas for volume 172.20.6.51:10000: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.20.6.51:10000: connect: connection refused\""
time="2020-08-24T20:31:58Z" level=debug msg="Instance handler updated instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 state, old state running, new state stopped"
time="2020-08-24T20:31:58Z" level=debug msg="Stop monitoring pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:31:58Z" level=debug msg="Requeue longhorn-system/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a due to error <nil> or Operation cannot be fulfilled on engines.longhorn.io \"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\": the object has been modified; please apply your changes to the latest version and try again"
time="2020-08-24T20:31:58Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908803\", FieldPath:\"\"}): type: 'Normal' reason: 'Detached' volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a has been detached"
time="2020-08-24T20:31:59Z" level=debug msg="Requeue longhorn-system/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 due to conflict: Operation cannot be fulfilled on engines.longhorn.io \"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\": the object has been modified; please apply your changes to the latest version and try again"
time="2020-08-24T20:31:59Z" level=warning msg="Replica pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-r-25a8ed78 of volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a has been automatically salvaged"
time="2020-08-24T20:31:59Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908868\", FieldPath:\"\"}): type: 'Warning' reason: 'AutoSalvaged' Replica pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-r-25a8ed78 of volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a has been automatically salvaged"
time="2020-08-24T20:32:01Z" level=debug msg="Prepare to create instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:32:01Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9\", UID:\"840aaec6-6ebf-42ca-985b-86268ba6d890\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908902\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:32:02Z" level=debug msg="Instance process pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 had been created, need to wait for instance manager update"
time="2020-08-24T20:32:02Z" level=debug msg="Instance handler updated instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 state, old state stopped, new state starting"
time="2020-08-24T20:32:04Z" level=debug msg="Instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 starts running, IP 172.20.6.51"
time="2020-08-24T20:32:04Z" level=debug msg="Instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 starts running, Port 10000"
time="2020-08-24T20:32:04Z" level=debug msg="Instance handler updated instance pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9 state, old state starting, new state running"
time="2020-08-24T20:32:04Z" level=debug msg="Start monitoring pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a-e-3c597aa9"
time="2020-08-24T20:32:17Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/b658164e-5e79-41af-9449-575f7c45181f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:32:18Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:32:18Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/fce7f478-38bd-419c-912d-f28ccf994fd3/volume-subpaths/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/prometheus/2"
time="2020-08-24T20:32:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908891\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n, error exit status 32"
time="2020-08-24T20:32:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24908891\", FieldPath:\"\"}): type: 'Normal' reason: 'Attached' volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a has been attached to prod-cluster-node-1"
time="2020-08-24T20:32:18Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/fce7f478-38bd-419c-912d-f28ccf994fd3/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:32:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24909025\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n, error exit status 32"
time="2020-08-24T20:32:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24909029\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n, error exit status 32"
time="2020-08-24T20:32:33Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24909029\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n, error exit status 32"
time="2020-08-24T20:32:44Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/b658164e-5e79-41af-9449-575f7c45181f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:32:45Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:32:45Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24909029\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n, error exit status 32"
time="2020-08-24T20:33:03Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/fce7f478-38bd-419c-912d-f28ccf994fd3/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:33:03Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:33:03Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24909029\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n, error exit status 32"
time="2020-08-24T20:33:14Z" level=debug msg="Succeed to handle the remount request for volume pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on host path /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db"
time="2020-08-24T20:33:14Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a\", UID:\"3c722424-a9cb-4e34-868c-e548c78a5d46\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"24909029\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on prod-cluster-node-1: Failed to execute: nsenter [--mount=/host/proc/1285/ns/mnt --net=/host/proc/1285/ns/net umount /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount], output , stderr, umount: /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount: target is busy.\n, error exit status 32"

Also there are duplicated mounts on the node from previous pods mount | grep pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a

/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (ro,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/b658164e-5e79-41af-9449-575f7c45181f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (ro,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/b658164e-5e79-41af-9449-575f7c45181f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (ro,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/b658164e-5e79-41af-9449-575f7c45181f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (ro,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (ro,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (ro,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/fce7f478-38bd-419c-912d-f28ccf994fd3/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (ro,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/fce7f478-38bd-419c-912d-f28ccf994fd3/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/fce7f478-38bd-419c-912d-f28ccf994fd3/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/d46cafb1-d33a-47a2-8ce2-856c03759fb9/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/eba6cdd6-20f6-43ab-a2bd-b3aba18b4871/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/b658164e-5e79-41af-9449-575f7c45181f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)
/dev/longhorn/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a on /var/lib/kubelet/pods/66e0b868-aad8-4da8-a6ad-0b329851500f/volumes/kubernetes.io~csi/pvc-c1431f9c-eb79-4e07-a6fe-5a97b9cad99a/mount/prometheus-db type ext4 (rw,relatime,data=ordered)

Environment:

  • Longhorn version: 1.0.2
  • Kubernetes version: v1.17.4
  • Node OS type and version: Ubuntu 18.04

Additional context Sometimes prometheus restarts with another data in pvc. Looks like it somehow switch pvc or its content. It happens then prometheus liveness probe failed during this remount errors and gets restarted by k8s. You can see that in all dashboards. For example, when all is ok metrics looks fine. When this happens all metrics that you saw before got disappeared and new metrics successfully saved in this strange pvc. If you manually restart (delete) pod with prometheus all becomes fine. All old metrics appears as they should but metrics that has been saved during “pvc switching” disappears. Don’t know how to describe this better.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 23 (17 by maintainers)

Commits related to this issue

Most upvoted comments

Some of my thoughts about the above issue. Feel free to correct me though 😃

How does Longhorn auto remount feature works incorrectly when pod using subpath?

Let’s follow the steps in case-1 in the above setup. Stop at step 1 when nothing bad happened yet. If we ssh into node-1 and run findmnt we will get something like this:

root@phan-cluster-v25-worker-1:~# findmnt /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9 
TARGET                                                                                                                         SOURCE                                                   FSTYPE OPTIONS
/var/lib/kubelet/pods/c171f61e-a0ad-4209-8fb1-32b1e40b3812/volumes/kubernetes.io~csi/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/mount
                                                                                                                               /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9   ext4   rw,rela
/var/lib/kubelet/pods/c171f61e-a0ad-4209-8fb1-32b1e40b3812/volume-subpaths/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/nginx/0    /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9[/html]
                                                                                                                                                                                        ext4   rw,rela

At this moment, there are 2 mount points corresponding to the PVC: .../volumes/.../mount and .../volume-subpaths/.../nginx/0. The .../volume-subpaths/.../nginx/0 is actually a bind mount point corresponding to directory /html inside the filesystem at the mount point .../volumes/.../mount. This means that if we mount anything to .../volume-subpaths/.../nginx/0 it will propagate to .../volumes/.../mount/html

At step 2, when we kill replica the instance manager pod, the volume becomes faulty. Longhorn will auto salvage the volume by detaching, does some work, attaching, and auto remounting. When the volume is attached to node-1, it appear to have the same name under /dev/longhorn/volume-name, but it is actually a new block device with a new major-minor number. When Longhorn does auto remounting, Longhorn loops over all existing mount points, umount n-1 layer for the point which has n layers (n>1) then mount the new block device at that mount point.

If we run findmnt we will get this output:

root@phan-cluster-v25-worker-1:~# findmnt /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9 
TARGET                                                                                                                           SOURCE                                                 FSTYPE OPTIONS
/var/lib/kubelet/pods/c171f61e-a0ad-4209-8fb1-32b1e40b3812/volumes/kubernetes.io~csi/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/mount
                                                                                                                                 /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9 ext4   rw,rela
/var/lib/kubelet/pods/c171f61e-a0ad-4209-8fb1-32b1e40b3812/volume-subpaths/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/nginx/0      /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9[/html]
                                                                                                                                                                                        ext4   rw,rela
/var/lib/kubelet/pods/c171f61e-a0ad-4209-8fb1-32b1e40b3812/volumes/kubernetes.io~csi/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/mount
                                                                                                                                 /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9 ext4   rw,rela
/var/lib/kubelet/pods/c171f61e-a0ad-4209-8fb1-32b1e40b3812/volume-subpaths/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/nginx/0      /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9 ext4   rw,rela
/var/lib/kubelet/pods/c171f61e-a0ad-4209-8fb1-32b1e40b3812/volumes/kubernetes.io~csi/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/mount/html
                                                                                                                                 /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9 ext4   rw,rela

Notice that there are not 4 but 5 mount points now even though Longhorn only did 2 additional mount commands (one for .../volumes/.../mount and .../volume-subpaths/.../nginx/0). The reason is that .../volume-subpaths/.../nginx/0 is actually a bind mount point corresponding to directory /html inside the filesystem at the mount point .../volumes/.../mount. This means that if we mount anything to .../volume-subpaths/.../nginx/0 it will propagate to .../volumes/.../mount/html

Now if we try to umount .../volumes/.../mount, we will get the target busy error because there is a nested mount point .../volumes/.../mount/html.

Moreover, Longhorn auto remounted the root filesystem of vol-1 at .../volume-subpaths/.../nginx/0 which is incorrect.

Reproduce steps:

  1. Create a cluster of 3 nodes. Install Longhorn
  2. Create storageclass longhorn-no-ha:
    kind: StorageClass
    apiVersion: storage.k8s.io/v1
    metadata:
        name: longhorn-no-ha
    provisioner: driver.longhorn.io
    allowVolumeExpansion: true
    parameters:
        numberOfReplicas: "1"
        dataLocality: "best-effort"
        staleReplicaTimeout: "2880" # 48 hours in minutes
        fromBackup: ""
    
  3. Deploy this statefulset:
apiVersion: v1
kind: Service
metadata:
 name: nginx
 labels:
   app: nginx
spec:
 ports:
 - port: 80
   name: web
 selector:
   app: nginx
 type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
 name: web
spec:
 selector:
   matchLabels:
     app: nginx # has to match .spec.template.metadata.labels
 serviceName: "nginx"
 replicas: 1 # by default is 1
 template:
   metadata:
     labels:
       app: nginx # has to match .spec.selector.matchLabels
   spec:
     restartPolicy: Always
     terminationGracePeriodSeconds: 10
     containers:
     - name: nginx
       image: k8s.gcr.io/nginx-slim:0.8
       ports:
       - containerPort: 80
         name: web
       volumeMounts:
       - name: www
         mountPath: /usr/share/nginx/html
         subPath: html
 volumeClaimTemplates:
 - metadata:
     name: www
   spec:
     accessModes: [ "ReadWriteOnce" ]
     storageClassName: "longhorn-no-ha"
     resources:
       requests:
         storage: 1Gi

Case 1: Disconnected Once

  1. Find the node where the statefulset pod is running. Let’s say the node is node-1, and the pod is pod-1
  2. exec into pod-1, create a file test_data.txt inside the folder /usr/share/nginx/html
  3. Kill the replica instance manager pod on node-1. This action simulates a network disconnection (the engine process of the PVC cannot talk to the replica process on the killed replica instance manager pod).
  4. Since the corresponding volume (say vol-1) has only 1 replica. The volume becomes faulty. Longhorn will auto salvage the volume. After that, Longhorn will auto remount the volume on previous mount points. (The auto remount work incorrectly in this case because pod-1 is using subpath. For more explanation of how it works incorrectly, see below)
  5. Wait for the vol-1 to become healthy. Exec into the pod-1. Run ls /usr/share/nginx/html. Verify that you get this error:
    ls: reading directory 'usr/share/nginx/html': Input/output error
    
    This is a signal that the underline filesystem no longer exists. This error again confirms that Longhorn auto remount worked incorrectly.
  6. Manually kill pod-1. Verify the pod is killed successfully
  7. Wait until the pod-1 running. Exec into the pod-1. Run ls usr/share/nginx/html. Verify that you don’t get any error. Everything is back to normal.

Case 2: Disconnected Twice

  1. Kill the replica instance manager pod on node-1.
  2. Wait for the vol-1 to become healthy.
  3. Kill the replica instance manager pod on node-1 again.
  4. Wait for the vol-1 to become healthy. Exec into the pod-1. Run ls /usr/share/nginx/html. Verify that you get this error:
    ls: reading directory 'usr/share/nginx/html': Input/output error
    
  5. Manually kill pod-1. Verify the pod stuck in terminating state forever.
  6. Looking into the log of longhorn-csi-plugin on node-1, we can see this error:
    [longhorn-csi-plugin-c6tdn longhorn-csi-plugin] time="2020-10-20T06:14:49Z" level=error msg="GRPC error: rpc error: code = Internal desc = Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/c171f61e-a0ad-4209-8fb1-32b1e40b3812/volumes/kubernetes.io~csi/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/mount\nOutput: umount: /var/lib/kubelet/pods/c171f61e-a0ad-4209-8fb1-32b1e40b3812/volumes/kubernetes.io~csi/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/mount: target is busy.\n\n" 
    

@shuo-wu Yes, might make sense to kill the pod instead. Ideally we should restart the pod but there isn’t a way to do it in Kubernetes as I know.