kubernetes: Slow volume attach for single ReadOnlyMany volume across multiple nodes
What happened: A GCE PD volume is used in read-only mode (through a PVC & PV pair in ReadOnlyMany mode) by over 100 pods running across tens of nodes. Volume attachment sometimes takes over 30min for some of these pods.
What you expected to happen: Attaching should take seconds instead of minutes.
How to reproduce it (as minimally and precisely as possible):
- Provision a read-only GCE PD, and the associated PVC & PV.
- Run a Job with parallelism of 100+ pods in a cluster with tens of nodes (we tested with 50+), and use the PD inside all pods.
Anything else we need to know?: Generally attach/detach calls are parallelized (through OperationExecutor), but calls with the same volume name are serialized. This was done as a way to prevent races between attach and detach calls for the same volume. However, it’s sufficient to provide this guarantee for each node, as long as the volume is Read[Write|Only]Many.
Environment:
- Kubernetes version (use
kubectl version
): Tested with multiple 1.10 patch releases (including 1.10.6 and 1.10.9) - Cloud provider or hardware configuration: GCP
/sig storage /priority critical-urgent /assign /cc @davidz627 @jingxu97 @msau42 @saad-ali
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 31 (27 by maintainers)
I think @verult is right. Created #85166
Answering my own questions above :
For (3),
IsMultiAttachForbidden()
probably wasn’t designed with parallelism in mind, i.e. its original intention was probably not to guard against races.I confirmed by inspection that the race @gnufied mentioned in his #2 point does exist: https://github.com/kubernetes/kubernetes/blob/7bac6ca73ad6fa1b0b637e801a524d6a1df663ea/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L272
GetNodesForAttachedVolume()
returns a list of attached nodes for the volume in ASW, and this information is updated at the end ofAttach()
, which will run in parallel. For (2), putting multi-attach logic inIsOperationPending()
or anywhere else in the operation executor could get messy. I’d like to avoid it if possible, and instead fix the race with the logic that already exists in reconciler.IMO the multi-attach logic should check the list of volumes intending to attach (i.e. read from DSW), rather than those that are actually attached.
Sorry for not being clear in my earlier comment. My comment was in response to @davidz627 's observation that
DiskIsAttached
is the heaviest part of re-attaching a disk that is already attached.My point was that if you have one disk and many GCE instances, it is much faster to use
disks.get
to get the disk and check all instances against itsusers
array, than making a separate API call for each instance. It will also keep you under the GCE read API quota more effectively. It is a good optimization when the number of pods using the volume is larger than the number of nodes in the cluster.It does not improve the flow where the disk needs to be attached to a node. The parallelization efforts being discussed now will help with that.
If we do plan to serialize operations on node+volume combination, we also have to fix https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L264 which will again try to serialize operations based on volume name.
I think @droyo suggests to change the implementation of https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/gce/gce_disks.go#L580 Currently, it tries to getinstances and check the disk from the instance. The other option is to use getdisk API to verify whether disk is attached to the node. I am not sure how much performance differences between these two.
@davidz627 in the case we are seeing, because there’s a large number of nodes, the number of replicas scheduled per node is likely pretty low, so re-attach is not the common case. I agree though that an average of 18s per attach call still doesn’t feel quite right and there might be another error at play here. Per @gnufied 's suggestions I’ll take a look at operation latency.
Regardless of whether it’s the sole cause, the fact that attach calls are serialized even if they are for different nodes seems like a scalability bug to me, especially for Read[Write|Only]Many volumes