rook: On NodeLost, the new pod can't mount the same volume.

Is this a bug report or feature request?

Bug Report

Bug Report

What happened: A PV was attached to a POD. The node running the POD lost the power. Currently the status for the node is NotReady.

Deployment is trying to restart the POD on another node, but rook doesn’t let the volume remount:

  Warning  FailedMount  16m (x82 over 2h)  kubelet, fiona-dtn-1.ucsc.edu  MountVolume.SetUp failed for volume "pvc-31b7555e-122d-11e8-9422-0cc47a6be994" : mount command failed, status: Failure, reason: Rook: Mount volume failed: failed to attach volume pvc-31b7555e-122d-11e8-9422-0cc47a6be994 for pod perfsonar/perfsonar-toolkit-bdc95f48f-xhzt7. Volume is already attached by pod perfsonar/perfsonar-toolkit-bdc95f48f-zlhld. Status Running
  Warning  FailedMount  2m (x73 over 2h)   kubelet, fiona-dtn-1.ucsc.edu  Unable to mount volumes for pod "perfsonar-toolkit-bdc95f48f-xhzt7_perfsonar(2394445f-138c-11e8-9422-0cc47a6be994)": timeout expired waiting for volumes to attach/mount for pod "perfsonar"/"perfsonar-toolkit-bdc95f48f-xhzt7". list of unattached/unmounted volumes=[pgsql-persistent-storage]

The status for old pod is not running, as it’s claiming:

Dmitrys-MBP-2:~ dimm$ kubectl describe pod perfsonar-toolkit-bdc95f48f-zlhld -n perfsonar
Name:                      perfsonar-toolkit-bdc95f48f-zlhld
Namespace:                 perfsonar
Node:                      k8s-chase-ci-06.calit2.optiputer.net/67.58.53.163
Start Time:                Thu, 15 Feb 2018 00:49:55 -0800
Labels:                    k8s-app=perfsonar-toolkit
                           pod-template-hash=687519049
Annotations:               kubernetes.io/limit-ranger=LimitRanger plugin set: memory request for container perfsonar-toolkit; memory limit for container perfsonar-toolkit; memory request for init container volume-mount-chown; m...
Status:                    Terminating (expires Fri, 16 Feb 2018 18:42:34 -0800)
Termination Grace Period:  30s
Reason:                    NodeLost
Message:                   Node k8s-chase-ci-06.calit2.optiputer.net which was running pod perfsonar-toolkit-bdc95f48f-zlhld is unresponsive
IP:                        10.244.5.6
Controlled By:             ReplicaSet/perfsonar-toolkit-bdc95f48f

What you expected to happen: When node is lost, a pod having PV mounted should restart somewhere else without human intervention.

How to reproduce it (minimal and precise): Start a deployment with a pod mounting rook PV. Kill the physical node running the pod.

Environment:

OS (e.g. from /etc/os-release): CentOS Linux 7 (Core)
Kernel (e.g. uname -a): Linux 4.14.15-1.el7.elrepo.x86_64 #1 SMP Tue Jan 23 20:28:26 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration: Baremetal
Rook version (use rook version inside of a Rook Pod): rook: v0.6.0-219.g09dc9b8
Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“9”, GitVersion:“v1.9.2”, GitCommit:“5fa2db2bd46ac79e5e00a4e6ed24191080aa463b”, GitTreeState:“clean”, BuildDate:“2018-01-18T10:09:24Z”, GoVersion:“go1.9.2”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“9”, GitVersion:“v1.9.3”, GitCommit:“d2835416544f298c919e2ead3be3d0864b52323b”, GitTreeState:“clean”, BuildDate:“2018-02-07T11:55:20Z”, GoVersion:“go1.9.2”, Compiler:“gc”, Platform:“linux/amd64”}
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 42
Comments: 110 (63 by maintainers)

Links to this issue

My completely automated Homelab featuring Kubernetes

Commits related to this issue

docs: add doc to recover from pod from lost node this commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same vol... — committed to subhamkrai/rook by subhamkrai 3 years ago
docs: add doc to recover from pod from lost node this commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same vol... — committed to subhamkrai/rook by subhamkrai 3 years ago
docs: add doc to recover from pod from lost node This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same vol... — committed to subhamkrai/rook by subhamkrai 3 years ago
docs: add doc to recover from pod from lost node This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same vol... — committed to subhamkrai/rook by subhamkrai 3 years ago
docs: add doc to recover from pod from lost node This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same vol... — committed to subhamkrai/rook by subhamkrai 3 years ago
docs: add doc to recover from pod from lost node This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same vol... — committed to subhamkrai/rook by subhamkrai 3 years ago
docs: add doc to recover from pod from lost node This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same vol... — committed to subhamkrai/rook by subhamkrai 3 years ago
docs: add doc to recover from pod from lost node This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same vol... — committed to subhamkrai/rook by subhamkrai 3 years ago
docs: add doc to recover from pod from lost node This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same vol... — committed to subhamkrai/rook by subhamkrai 3 years ago
docs: add doc to recover from pod from lost node This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same vol... — committed to subhamkrai/rook by subhamkrai 3 years ago
docs: add doc to recover from pod from lost node This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same vol... — committed to subhamkrai/rook by subhamkrai 3 years ago
docs: add doc to recover from pod from lost node This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same vol... — committed to rook/rook by subhamkrai 3 years ago
docs: add doc to recover from pod from lost node This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same vol... — committed to parth-gr/rook by subhamkrai 3 years ago
docs: add doc to recover from pod from lost node This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same vol... — committed to parth-gr/rook by subhamkrai 3 years ago

Most upvoted comments

@davidhiendl I totally agree with you that this is a fairly substantial issue (for the same reasons you mentioned). I’m going to write my thoughts down on how I think this could work (take this with a grain of salt, I have a fair bit of experience with distributed systems but I’m not part of Rook or Ceph).

The following cases need to be considered:

Node blows up (Triple-fault, Kernel Panic, CPU goes up in smoke, Power goes off, CPU hangs itself indefinitely, …)
Node hangs for a long but definite amount of time
Node gets partitioned (routing issue, cable got pulled, …)

In all cases we want to reattach the RBD volume to another node while retaining all data and simultaneously preventing the previous node from reading or writing from the dataset. Case 1 is fairly trivial to handle, just reattach and remount somewhere else. Cases 2 and 3 require a lock on the RBD so that only one node can have it attached at the same time thus preventing writes from stale nodes (this is called fencing) because otherwise the process on the faulty node can just queue up writes and will eventually issue all of them as soon as it starts working again (which will corrupt data). Ceph has two locking systems, the old one and the new one (Exclusive Locking). The new one should be able to support this usecase, but there exists conflicting information if its lock transfers are cooperative or not (if they are this would mean that that locking mechanism would be unsuitable for HA purposes).

The problem here is that as soon as you use RBD fencing the machine being fenced cannot issue any IO requests to the cluster and all of them remain pending. There is currently no way to unmap an RBD device with pending IO requests (https://github.com/ceph/ceph-client/blob/8b8f53af1ed9df88a4c0fbfdf3db58f62060edf3/drivers/block/rbd.c#L5857) so you are left with unkillable processes and a dead device which bloats your System Load and a possible deathtrap for any management process that touches that device because it will immediately go into Unkillable IO sleep state.

Basically three things need to happen to make Rook work for your usecase:

Rook needs to use Ceph’s exclusive locking and not its own since it cannot implement suitable fencing due to atomicity constraints.
Ceph RBD needs to implement a force-unmap option to discard all pending IO so that a machine which has been fenced is not a disaster waiting to happen
Rook needs to watch the RBD locks and kill all associated resources (containers, mounts, maps) if it detects that the node lost its exclusive RBD lock for a volume

+13

lorenz on Aug 16, 2018

We believe this issue will be fixed with CSI, which we intend to be a total replacement for FlexVolume as soon as CSI is ready. CSI will be in beta status for the v1.0 release targeted in the next few weeks, and CSI features are still being finalized.

This issue was just moved to the v1.1 release milestone, as we hope CSI will be at a stable release by then. As part of the CSI stable release, we intend to follow the original issue’s repro steps to ensure that the issue does not exist in the CSI release implementation.

+10

BlaineEXE on Apr 10, 2019

Thanks all for the feedback, what I’m gathering now is that the process would be more like the following:

A node goes offline
Rook detects the unresponsive node
Wait for some configurable timeout, perhaps with a default of 15 minutes
Rook creates a NetworkFence CR that fences the unresponsive node (there is no longer a need for the ceph osd blocklist command, right?)
Rook deletes the volumeattachments for the rbd PVs on that node
Rook force deletes the application pod(s) on that node with rbd volumes, allowing the pod(s) to resume elsewhere

From all the discussion, this will at least allow the application to attempt starting elsewhere to recover automatically in many scenarios. Corruption at the application level will not always be avoided (similar to a power outage), but data will still maintain consistency at the ceph data layer.

travisn on May 10, 2022

As various updates to this issue poured in today, I was also working to establish what problems were present in ceph-csi, that disallowed the mount on a second node. So here is my update, which also closely mirrors other comments made above.

NOTE: All testing that I did was using kubernetes v1.17

There are 2 categories of problems being discussed here,

Recovering from a node failure: allowing rescheduled pods, from the failed node to other nodes, access to their PVs
Fencing/blacklisting stale clients: to prevent inadvertent volume corruption

Recovering from a node failure: allowing rescheduled pods, from the failed node to other nodes, access to their PVs

The problems reported in this issue is mostly around, pod never recovering if a node hosting the pod lost communication with the kubernetes api server (or masters). Also, these are for RWO PVs only, and NOT for RWX/ROX access mode PVs (as these can be attached to multiple nodes at the same time).

For pods that are NOT not governed by the StatefulSets controller, the pods recover in ~11+ minutes (based on default settings) due to the following kubernetes events,

By default all pods are granted toleration of 300 seconds before termination, when the node they are running on enters the “not-ready” or “unreachable” taints
a. When a node loses connectivity to kubernetes API server, at the end of 300 seconds (5 minutes), the taint_manager marks the pods for deletion due to the above toleration
b. Based on the controller in use (ReplicaSet or Deployment (but NOT StatefulSets)), a new instance of the pod is scheduled to run on a different node
At this time, the kubernetes attach-detach controller kicks in, but will not force detach the PV already attached to the unrechable node
c. The force detach timeout is 6 minutes (governed by maxWaitForUnmountDuration), at which time the PV is force detached and attached to the new node
The pod that was created in (b) now gets that PV attached due to © and proceeds to move into the running state

During © is when CSI plugins even come into play, or are sent the required ControllerPublish/NodeStage calls, before this time duration (IOW, 11+ minutes) there is no way for CSI plugins to react to the environment.

Workarounds:

Reducing the NoExecute taints, for the pods, to lower values can shorten the 5 minute waits to start the pod on a new node, and thus reduce the overall 11+ minutes wait time for the pod to restart.
There is no current way to configure a lower value for the maxWaitForUnmountDuration, or avoid the forced detach wait (IOW, the latter 6 minutes)
- Also, this is a force detach, where the status of the failed node is unknown, and it could still be running! This leads to problem (2)

NOTE: For StatefulSets, kubernetes waits for termination of the pod on the unreachable node, and only then restarts a new instance on one of the surviving nodes. This results in step © above never getting triggered. To overcome this, these StatefulSet pods need to be force deleted or the node needs to be removed from the kubernetes cluster. Post this force-deletion/node-removal, there would still be a 6 minute wait for the force detach to trigger and the new instance to start running.

Fencing/blacklisting stale clients: to prevent inadvertent volume corruption

Even with the above 11+ minute wait, if the node that was unreachable, comes back online there is a potential for data corruption in case the volume is written to prior to garbage collection on the pod/mount/image-mapping on the node.

This situation needs fencing, or ceph blacklisting, for RWO mode PVs.

This latter part is being discussed on how to best overcome it and provide the required blacklisting in ceph/ceph-csi#578

ShyamsundarR on Feb 5, 2020

I experienced this issue as well. I am using a Deployment with a strategy of type Recreate and there’s a ceph-block volume attached to the workload. My node died overnight and there were pods stuck in Terminating and new ones stuck in ContainerCreating.

I was recalling my time using Longhorn and in their documentation they have a specific section on how to handle this issue with configuration here, would it be possible for rook-ceph to provide similar functionality?

This would be really cool to support since I have 3 OSDs being replicated and technically the pod could live anywhere in my cluster.

Edit: According to https://github.com/ceph/ceph-csi/issues/740#issuecomment-770750182 could there be some simple logic built in that could check if a node is notready and then delete the volumeattachments that are associated with it? If not, I guess it could be scripted someway.

onedr0p on Apr 30, 2022

There is a proposal to include node/volume fencing to the CSI specification. Kubernetes (and Ceph-CSI) would be able to initiate a more reliable fail-over when a node becomes unresponsive.

container-storage-interface/spec#477

nixpanic on May 5, 2021

@abh Workaround to find it: kubectl -n rook-ceph-system get volumes.rook.io <pvc/pv-name> -o yaml

...
apiVersion: rook.io/v1alpha2
attachments:
- clusterName: rook-ceph
  mountDir: /var/lib/kubelet/pods/b019b4d9-f4a3-41a7-a191-2a06c3f6bcf3/volumes/ceph.rook.io~rook-ceph-system/pvc-650bfc29-36a5-11e9-a06c-00505689b514
  node: node01
  podName: gitaly-0
  podNamespace: gitlab
  readOnly: false
kind: Volume
...

In mountDir: /var/lib/kubelet/pods/<pod-uid>/... there is your old pod-uid and in node its node too.

What I did (I’m not sure what happens in the background so, just careful, but I prayed 😃 and it worked with gitaly and a postgres cluster): Changed the old pod-uid to the new pending pod-uid, and node to the pending pod node name. I needed to delete the pod and it was scheduled to another node LoL. So I went back and modified the volumes.rook.io resource, and again deleted the pod. Played it while the new pod was rescheduled on the same node, what I wrote in node.

papdaniel on May 18, 2020

Adding to 0.9 since this is hurting production deployments such as @dimm0

jbw976 on Oct 9, 2018

design doc in progress https://github.com/rook/rook/pull/12253 here

subhamkrai on May 19, 2023

Indeed, by blocklisting the node/vm that does not respond anymore, it should be possible to recover from a certain set of failures. It should just be clear that blocklisting/fencing is not a bulletproof solution. Even when Rook and Ceph-CSI do support the functionality, there will be failures that can not (easily) be recovered.

I hope Rook with Ceph-CSI can consume the CSI-Addons NetworkFence CRD. Ceph-CSI offers this option at the moment, and more consumers of the API would be most welcome.

nixpanic on May 5, 2022

I’m not in the list @travisn asked for feedback, but I still have something to say! 😃

What if the step 4 (blocklisting) was implementing through another CRD (eg RookCephBlockedNode) with name matching the blocklisted node.

This would have at least these 2 huge benefits:

No need for additional tools for monitoring (just a “normal” configured prometheus should already be able to do it)
Unblocking (manual step by an admin/script) would simply be a kubeclt delete.

zerkms on May 3, 2022

By itself the policy of force deleting pods is dangerous, which is why we have not yet implemented it. The volume could become corrupted in the following scenario:

A node goes offline (e.g. power or partial network outage)
Rook force deletes the app pod(s) with rbd volumes on the unresponsive node
The app pods start on another node and start writing to the volume
The original dead node comes back online, the app pod on that node is still running and writing data to the volume, and now suddenly there are two writers to the volume, which will corrupt the volume.

I don’t see any solution yet for the proposed csi spec update that would help here. Thus, we have documented how to blocklist a dead node so the pod can then be force deleted. Of course, manual intervention is a big pain point.

In the meantime, what about the following automation for Rook to help in this scenario?

A node goes offline
Rook detects the unresponsive node
Wait for some configurable timeout, perhaps with a default of 15 minutes
Rook blocklists the node IP so Ceph will block all future writes from that node’s IP
Rook force deletes the application pod(s) on that node, allowing the pod to resume elsewhere

The admin would need to manually unblocklist a node after it is confirmed permanently offline and no longer a threat to data corruption. Rook simply cannot know when a node is safe to remove from the list.

The policy could be controlled by several policy settings such as:

nodePolicy:
  offlineWaitTimeBeforeForceDelete: 15m # time to wait before triggering the policy
  maxNodesToBlock: 3 # Stop blocking nodes after this count, to avoid blocking too many nodes. At some point, admin intervention is needed.
  forceDeleteDeployments: true
  forceDeleteStatefulSets: true
  includedNamespaces: # If empty, process app pods in all namespaces, otherwise only this list of namespaces
  - app1
  - app2
  skippedNamepaces: # if specified, process app pods in all namespaces except these
  - app3
  - app4

@leseb @BlaineEXE @Madhu-1 @humblec @nixpanic Is this feasible?

travisn on May 2, 2022

This isn’t actively being worked on for the flex driver since the focus has turned to the csi driver. Can you confirm if you’ve seen the same behavior with the csi driver?

@humblec @ShyamsundarR @dillaman Can you take a look at this issue? To summarize, how does the Ceph-CSI driver handle fencing? Does it use its own mechanism or query rbd directly if it is safe to attach a read-write-once volume?

Ceph-CSI also has this issue, as is discussed further in the comments above. There are various issues and active work/thoughts on getting this fixed as below, but till a variety of issues are resolved this needs admin intervention.

[1] ceph-CSI tracker for the issue: https://github.com/ceph/ceph-csi/issues/740 [2] Kubernetes storage SIG thread: https://groups.google.com/forum/#!topic/kubernetes-sig-storage/CRTVASR1atk [3] Storage SIG presentation as of kubecon Nov, 3rd week, 2019: https://docs.google.com/presentation/d/1UmZA37nFnp5HxTDtsDgRh0TRbcwtUMzc1XScf5C9Tqc/edit?usp=sharing [4] Further fencing discussion in ceph-CSI: https://github.com/ceph/ceph-csi/issues/578

ShyamsundarR on Dec 18, 2019

Rook creates a NetworkFence CR that fences the unresponsive node (there is no longer a need for the ceph osd blocklist command, right?)

Indeed, the NetworkFence CR will go through the csi-addons controller which passes the request on to the CSI driver. In case of Ceph-CSI, it will execute ceph osd blocklist (or similar through go-ceph). @Yuggupta27 is the authority on it, so he may want to correct/confirm it.

nixpanic on May 11, 2022

While the node loss documentation is help for visibility and attempt to recover, reopening since this is still fundamentally an issue.

travisn on Jan 28, 2022

@ShyamsundarR do you mind sending a doc PR with this write-up? Thanks.

Will do. I backed up a bit on other activities, so may get the initial PR in late next week or early the week after.

ShyamsundarR on Feb 14, 2020

Solved this by using following config:

     terminationGracePeriodSeconds: 0
     tolerations:
     - effect: NoExecute
       key: node.kubernetes.io/unreachable
       operator: Exists
       tolerationSeconds: 300

Tested with kubernetes v1.12 with rook-ceph v14.2. Note that FlexVolume is used.

danielxolsson on Jun 18, 2019

Hi,

I’m a novice rook and kubernetes user yet, and while assessing k8s+rook as a platform for my new infrastructure I also experienced this problem while trying various scenarios.

From what @lorenz suggested it looks like Ceph also needs to introduce some changes to RBD, which may take long time or not happen ever.

I understand that what I’m about to suggest is naive, but I believe being optional it may save a lot of problems while waiting for a proper solution. This is not a well thought yet idea, but to my humble opinion it looks like it should work (to some degree 😃 ).

Well, what if we had another concept, called rook-fencer, an http service (for simplicity, so that everybody could implement it any way they want), that is delivered/configured by an administrator specifically for the infrastructure: be it aws, gcp, bare metal, or whatever else.

Its (the rook-fencer’s) purpose is to fence the node on demand.

Typical scenario: let’s say we run in AWS and one of worker nodes kernel-panics. After some time kubernetes marks that node as unavailable and starts un-scheduling payloads from it. Stateless pods are evicted easily. But the ones with rook-ceph (or other rook-plugins) are blocked, since the rook agent may not unmount it.

Here is where rook-fencer comes into play: after a (configurable) timeout a rook agent (the one that serves the node where the new pod is scheduled) performs an http call to the fencer service.

Given that a fencer service is implemented by the administrator - it may do whatever it needs to ensure that the node is properly fenced. In one case it might be a firewall, that blocks all ceph traffic to and from the node. In other cases - it might be node shutdown or restart.

Let’s say in this case the administrator implemented a fencer service that shuts down the node. So if a node is in running state - it sends a shutdown request. And polls while it has not shut down. After it is shutdown - it returns HTTP 200 or any other response that means “the fencing operation completed successfully”.

After this point - the agent knows, that the node is properly fenced, no traffic is to be sent from it to ceph, so it’s safe to remove the volume mount from the kubernetes api.

This solution allows the implementation to be as simple or as robust as a particular infrastructure requires. For bare metal infrastructures it may be interaction with idrac’s, hardware routers or firewalls, etc. For cloud infrastructures - it’s even simpler, given they all provide APIs.

Technically rook project may provide simple implementations for common cloud providers, but the idea is to design an API that the rook agents may consume.

zerkms on Jan 25, 2019

@galexrt This definitely happens in 0.7.1. Spawn in 3 nodes and blow one up by either purposely crashing it (echo c > /proc/sysrq-trigger) or fully disconnecting all networking. Then a new pod will be spawned which will stay in Creating until the original node is brought back up again.

lorenz on Jun 3, 2018

it can be used. Read the sources for more explanations. https://github.com/kvaps/kube-fencing

This is not an rook or flex volume problem.

Fencing is the only way to guarantee that the volume will not be damaged.

R-omk on Feb 4, 2020

CSI is now doing rbd status ... and checking if volume is already mounted (which is great) The only part left is to remove the check on kubernetes side I guess…

dimm0 on Dec 5, 2019

What’s the latest on this? Is any solution actively being worked on? I see it removed from any milestones.

salanki on Dec 4, 2019

Honestly this really needs to be done by Ceph. I’ve seen many distributed systems go down or destroy data because of separation of logic and functionality critical to their correct operation (like fencing or locking). And depending on how you attach the RBD device Ceph will absolutely allow multiple R/W attachments.

lorenz on Jun 19, 2019

I’ve the same issue during the update of helm charts. Due to kubernetes rolling update feature, it happens that if the updated version of a pod gets deployed on another node, then it cannot mount the volume, because it’s still mounted by the running pod.

akkie on Mar 22, 2019

FWIW I seem to be running into this issue as well. Even after restarting the previously-powered-off node the logs still show Volume is already attached by pod, and this prevents a new pod from attaching (e.g., having followed Rook’s Wordpress tutorial, the wordpress-mysql pod ends up stuck in a ContainerCreating status).

PaulCapestany on Apr 2, 2018

We can open a new issue for automating this behaviour and continue discussion on conditions for node loss and its detection over there if that works with everyone.

If we plan to automate this one, we need to put this behind a flag as Rook will call the NetworkFence CR, which is neither native kubernetes nor Rook owns it.

Madhu-1 on Apr 11, 2023

Agreed that it is best to build on this KEP for detecting when nodes are offline and allowing K8s to take care of force deleting the app pods on the failed node. Then the remaining work is simpler for Rook to only worry about creating the NetworkFence CR.

👍 Sounds good to me.

In reality it will take quite some time to get the changes from the KEP into K8s. There is also not yet a proposal to automatically mark a node offline. Thus, I would still like to explore an option for Rook to take this step of automatically fencing nodes when they appear to be offline. I don’t see it ever being enabled by default, but will be a very useful option for admins to have. Somehow I don’t anticipate we will ever have a full solution from the K8s for this issue since it’s more of a storage problem for rbd to deal with to avoid corruption.

We can open a new issue for automating this behaviour and continue discussion on conditions for node loss and its detection over there if that works with everyone.

Rakshith-R on Apr 6, 2023

integrating blocklisting with kube-fencing, or https://github.com/medik8s would surely be a nice approach!

@nixpanic

Blocklisting the node will then cause the successful running Pod to get I/O errors, and potentially corrupt data

Shouldn’t ceph tolerate network or hardware failures though?

Yes, Ceph will take care of OSD issues, the storage managed by Ceph will not have any problems. However there is nothing that Ceph can do about partial data storage by applications (or a filesystem) running on a node that suddenly goes offline. When the application starts (or the filesystem gets mounted) on an other node, the application (or filesystem) might need some recovery (rollback to some checkpoint, or fsck).

Imagine a failure that happens while uploading a large file to some application Pod using ext4 on RBD. If the application gets interrupted (node offline) while the upload is in progress, the large file will be partially stored on the ext4 filesystem backed by Ceph RBD. From a Ceph point of view, a partial file is perfectly acceptable. From the applications point of view, that will need to consume the file, a partial file will not be very useful. Some form of recovery is needed by the application (delete the partial file, and have it re-uploaded). Similar problems can happen when the ext4 filesystem did not sync the latest data to the RBD image, in that case an fsck is required. Both types of recovery potentially require some user interaction (request a re-upload, handle fsck prompting). Asking users to do something (or informing an application about data issues) is not possible at this time.

nixpanic on May 4, 2022

integrating blocklisting with kube-fencing, or https://github.com/medik8s would surely be a nice approach!

nixpanic on May 4, 2022

While the proposal seems reasonable, I’m afraid this might go beyond what Rook is capable of doing. This looks to me like a machine-level type of problem and Rook is really high in the stack when trying to find out about the issue. Relying on the control plane to determine whether a node is alive or not might be sufficient. As far as I can tell (quick glance) this is how https://github.com/kvaps/kube-fencing seems to be implemented. It’s probably better than nothing to rely on the control plane but it’s not a 100% source of truth. We need something like Pacemaker along with STONITH (Fencing) agent to handle these cases properly.

leseb on May 4, 2022

Rook force deletes the application pod(s) on that node, allowing the pod to resume elsewhere

There is still an issue with this step, I think:

Kubernetes might still assume the volume is mounted on the non-responsive node, and will likely not mount it on an other node (RWO and RWOP volumes). Deleting or updating the VolumeAttachments for the unresponsive node might be required?

In my opinion, blocklisting nodes is a cluster operation, so it is something that Rook should manage. Ceph-CSI only provides a (simple) storage interface for creating/deleting/mounting/unmounting volumes, and is not aware of any state of the Ceph cluster. So a RookCephBlockedNode CRD makes sense to me.

Any automation around this, is dangerous. It is possible a Pod is running successfully, even when the Kubernetes management plane is non-responsive on a node. Blocklisting the node will then cause the successful running Pod to get I/O errors, and potentially corrupt data… Restarting the Pod on an other node might not be able to recover the corrupted data (a filesystem could require fsck, possibly needing user-input, which is not possible, causing a full outage of the Pod, similar for partially written application data).

nixpanic on May 3, 2022

Would it be possible/sensible to also issue a ceph osd blacklist add <ip> in that scenario too?

jamesorlakin on Jun 26, 2020

What about writes inside a network partition that occurred in the timeframe where the node was partitioned and not yet evicted? If the partition reconnects and the main group has had writes since then there is no way to resolve that. Sounds risky.

davidhiendl on Jun 19, 2019

I’m talking about this specific issue. Applications using the block device in single-access mode. Those need to failover to a new node when node fails.

dimm0 on Jan 21, 2019

@lorenz Thanks for the info.

So basically there is currently no way with a Ceph cluster managed by Rook to make a single-instance application failover to an other node when using rbd, right? This is a huge problem for us and for many other people I think. Unfortunately some legacy applications simply don’t scale horizontally or support failover …

As far as I understand it this problem does not exist when deploying Ceph directly on a host (we were hoping to avoid it but if it is not possible then we might have to)?

davidhiendl on Aug 16, 2018

@guilhermeblanco This issue is about mounting volumes previously attached to pods without clean detachment (because the node hung/disconnected/crashed). Your issue is about your underlying Ceph cluster failing. Just a small comment about that: Ceph will not gracefully handle trashing its data. The cluster will fail and usually take your data with it. If you need to migrate the data, you need to explicitly tell Ceph to migrate the data (and the mons) to the new nodes and wait for that process to complete before nuking the old ones.

lorenz on Apr 16, 2018