kubernetes: Slow volume attach for single ReadOnlyMany volume across multiple nodes

What happened: A GCE PD volume is used in read-only mode (through a PVC & PV pair in ReadOnlyMany mode) by over 100 pods running across tens of nodes. Volume attachment sometimes takes over 30min for some of these pods.

What you expected to happen: Attaching should take seconds instead of minutes.

How to reproduce it (as minimally and precisely as possible):

Provision a read-only GCE PD, and the associated PVC & PV.
Run a Job with parallelism of 100+ pods in a cluster with tens of nodes (we tested with 50+), and use the PD inside all pods.

Anything else we need to know?: Generally attach/detach calls are parallelized (through OperationExecutor), but calls with the same volume name are serialized. This was done as a way to prevent races between attach and detach calls for the same volume. However, it’s sufficient to provide this guarantee for each node, as long as the volume is Read[Write|Only]Many.

https://github.com/kubernetes/kubernetes/blob/f96849981203ea506ea8fa3f88c173d36cf688aa/pkg/volume/util/operationexecutor/operation_executor.go#L607-L608

Environment:

Kubernetes version (use kubectl version): Tested with multiple 1.10 patch releases (including 1.10.6 and 1.10.9)
Cloud provider or hardware configuration: GCP

/sig storage /priority critical-urgent /assign /cc @davidz627 @jingxu97 @msau42 @saad-ali

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 31 (27 by maintainers)

Most upvoted comments

I think @verult is right. Created #85166

tedyu on Nov 13, 2019

Answering my own questions above :

For (3), IsMultiAttachForbidden() probably wasn’t designed with parallelism in mind, i.e. its original intention was probably not to guard against races.

I confirmed by inspection that the race @gnufied mentioned in his #2 point does exist: https://github.com/kubernetes/kubernetes/blob/7bac6ca73ad6fa1b0b637e801a524d6a1df663ea/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L272 GetNodesForAttachedVolume() returns a list of attached nodes for the volume in ASW, and this information is updated at the end of Attach(), which will run in parallel. For (2), putting multi-attach logic in IsOperationPending() or anywhere else in the operation executor could get messy. I’d like to avoid it if possible, and instead fix the race with the logic that already exists in reconciler.

IMO the multi-attach logic should check the list of volumes intending to attach (i.e. read from DSW), rather than those that are actually attached.

verult on Oct 31, 2019

Sorry for not being clear in my earlier comment. My comment was in response to @davidz627 's observation that DiskIsAttached is the heaviest part of re-attaching a disk that is already attached.

My point was that if you have one disk and many GCE instances, it is much faster to use disks.get to get the disk and check all instances against its users array, than making a separate API call for each instance. It will also keep you under the GCE read API quota more effectively. It is a good optimization when the number of pods using the volume is larger than the number of nodes in the cluster.

It does not improve the flow where the disk needs to be attached to a node. The parallelization efforts being discussed now will help with that.

droyo on Feb 14, 2019

If we do plan to serialize operations on node+volume combination, we also have to fix https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L264 which will again try to serialize operations based on volume name.

gnufied on Feb 14, 2019

I think @droyo suggests to change the implementation of https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/gce/gce_disks.go#L580 Currently, it tries to getinstances and check the disk from the instance. The other option is to use getdisk API to verify whether disk is attached to the node. I am not sure how much performance differences between these two.

jingxu97 on Feb 14, 2019

@davidz627 in the case we are seeing, because there’s a large number of nodes, the number of replicas scheduled per node is likely pretty low, so re-attach is not the common case. I agree though that an average of 18s per attach call still doesn’t feel quite right and there might be another error at play here. Per @gnufied 's suggestions I’ll take a look at operation latency.

Regardless of whether it’s the sole cause, the fact that attach calls are serialized even if they are for different nodes seems like a scalability bug to me, especially for Read[Write|Only]Many volumes

verult on Feb 12, 2019