kubernetes: CSI VolumeAttachment slows pod startup time as # concurrent attaches increases

What happened:

A reporter testing their CSI driver at scale noticed issues:

Specifically they reported that the pod startup time is impacted by the number of concurrent CSI volume attachments in progress across a cluster.

According to the reporter, the pod start up time jumps from order of seconds (when there are a few volume attachments happening concurrently), to 1-2 minutes once there are >1300 concurrent volume attachments for pods using those attachments.

Similarly the reporter indicates that volume detach operations jump from order of seconds (when there are a few volume attach/detach operations happening concurrently), to 3-4 minutes once there are >1300 concurrent volume attachments.

The reporter mentioned that the slowness, once it is encountered, does not go away until the number of concurrent volume attaches is reduced below <500.

What you expected to happen:

Ideally, as the number of CSI volume attachments started in parallel increases the time it takes to start pods remains constant. We should identify the bottle necks and remove as many as possible.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

/sig storage /sig scalability /priority important-soon CC @msau42

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 27 (15 by maintainers)

Commits related to this issue

Most upvoted comments

@cduchesne did some great debugging on this. Here is what he found:

  • Kubernetes volume layer has code to periodically verify that volumes are attached.
  • It triggers every minute by default.
  • For CSI, the Kubernetes volume code has NOT implemented “bulk verify volume attach”, so kubernetes falls back to calling verify volumes attached PER NODE.
  • The Kubernetes CSI code fetches a VolumeAttachment FOR EVERY ATTACHED VOLUME on the specified node as part of VolumesAreAttached(...) to check that it is verified.

https://github.com/kubernetes/kubernetes/blob/037751e7ad2cd18db5b4e2a20ba894314c522b15/pkg/volume/csi/csi_attacher.go#L199

For clusters with many nodes/attached volumes, this results in so many calls to fetch VolumeAttachment from the Kubernetes API server, that the kube-controller-manager starts to get throttled (as reported).

If you encounter this, as a short-term work around, try one of the following:

  • Set disable-attach-detach-reconcile-sync flag on the kube-controller-manager to true.
    • This would completely disable the feature where k8s periodically verifies that the volumes that it thinks are attached are actually attached.
    • For now this should be safe because although the k8s side checks the VolumeAttachment, the CSI external-attacher object doesn’t periodically update the VolumeAttachment object (https://github.com/kubernetes/kubernetes/issues/79743 to fix that.)
  • Set the attach-detach-reconcile-sync-period value to a longer period (longer then 1 minute).
    • This flag controls the frequency of the verification. The default value is 1 minute. If you find that you really need the feature, you could keep it on and increase the frequency to mitigate this issue.

Longer term, plan:

  • Fix https://github.com/kubernetes/kubernetes/issues/79743
    • @davidz627 extended CSI spec in CSI v1.2.0 (see PR #374) so that the external-attacher can update the VolumeAttachment objects periodically in an efficient manner.
  • Fix this issue by, ideally, finding a way to implement BulkVerifyVolumes for CSI or, at least, fix the existing VolumesAreAttached(...) CSI implementation to “list” instead of “get” for every VolumeAttachment object.

We are also facing this issue. It is taking almost 3hrs to create Volumeattachment object. Is there any way to solve this?

@davidz627, How LIST_VOLUMES_PUBLISHED_NODES will help here, could you please help me understand the whole work flow of this?

I tried the recent 2.1.1 attacher, and the problem still exists… Also increased the reconsile time to 10 mins Is there any solution now?