kubernetes: Dynamic volume provisioning creates EBS volume in the wrong availability zone

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): dynamic volume provisioning

Is this a BUG REPORT or FEATURE REQUEST? (choose one): bug report

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.1", GitCommit:"82450d03cb057bab0950214ef122b67c83fb11df", GitTreeState:"clean", BuildDate:"2016-12-22T13:59:22Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.1", GitCommit:"82450d03cb057bab0950214ef122b67c83fb11df", GitTreeState:"clean", BuildDate:"2016-12-14T00:52:01Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): CoreOS 1185.5.0
Kernel (e.g. uname -a): Linux ip-10-0-1-121.ec2.internal 4.7.3-coreos-r3 #1 SMP Wed Dec 7 09:29:55 UTC 2016 x86_64 Intel® Xeon® CPU E5-2676 v3 @ 2.40GHz GenuineIntel GNU/Linux
Install tools: kaws
Others:

What happened:

Created a stateful set with a persistent volume claim. Dynamic volume provisioning created an EBS volume in the us-east-1a availability zone, despite all the masters and nodes in the cluster being in us-east-1e. I tried it twice with the same results both times.

The PVC:

$ kubectl describe pvc -n errbit
Name:		mongodb-mongodb-0
Namespace:	errbit
StorageClass:	standard
Status:		Bound
Volume:		pvc-5fc8e90b-c8aa-11e6-9924-069508572ed2
Labels:		app=errbit
		component=mongodb
Capacity:	1Gi
Access Modes:	RWO
No events.

The PV created by dynamic provisioning:

$ kubectl describe pv
Name:		pvc-5fc8e90b-c8aa-11e6-9924-069508572ed2
Labels:		failure-domain.beta.kubernetes.io/region=us-east-1
		failure-domain.beta.kubernetes.io/zone=us-east-1a
StorageClass:	standard
Status:		Bound
Claim:		errbit/mongodb-mongodb-0
Reclaim Policy:	Delete
Access Modes:	RWO
Capacity:	1Gi
Message:
Source:
    Type:	AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:	aws://us-east-1a/vol-0d32c25dc73f029af
    FSType:	ext4
    Partition:	0
    ReadOnly:	false
No events.

The stateful set:

$ kubectl describe statefulset mongodb -n errbit
Name:			mongodb
Namespace:		errbit
Image(s):		mongo:3.4.0
Selector:		app=errbit,component=mongodb
Labels:			app=errbit,component=mongodb
Replicas:		1 current / 1 desired
Annotations:		kubectl.kubernetes.io/last-applied-configuration={"kind":"StatefulSet","apiVersion":"apps/v1beta1","metadata":{"name":"mongodb","namespace":"errbit","creationTimestamp":null,"labels":{"app":"errbit","component":"mongodb"}},"spec":{"replicas":1,"template":{"metadata":{"creationTimestamp":null,"labels":{"app":"errbit","component":"mongodb"}},"spec":{"containers":[{"name":"mongodb","image":"mongo:3.4.0","args":["--auth"],"ports":[{"name":"mongodb","containerPort":27017}],"resources":{},"volumeMounts":[{"name":"mongodb","mountPath":"/data/db"}]}]}},"volumeClaimTemplates":[{"metadata":{"name":"mongodb","creationTimestamp":null},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"1Gi"}}},"status":{}}],"serviceName":"mongodb"},"status":{"replicas":0}}
CreationTimestamp:	Thu, 22 Dec 2016 16:53:24 -0800
Pods Status:		0 Running / 1 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
  FirstSeen	LastSeen	Count	From		SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----		-------------	--------	------			-------
  12m		12m		1	{statefulset }			Normal		SuccessfulCreate	pet: mongodb-0
  11m		11m		1	{statefulset }			Normal		SuccessfulCreate	pvc: mongodb-mongodb-0

The stateful set’s pod, pending due to the volume being in the wrong zone:

$ kubectl describe pod mongodb-0 -n errbit
Name:		mongodb-0
Namespace:	errbit
Node:		/
Labels:		app=errbit
		component=mongodb
Status:		Pending
IP:
Controllers:	StatefulSet/mongodb
Containers:
  mongodb:
    Image:	mongo:3.4.0
    Port:	27017/TCP
    Args:
      --auth
    Volume Mounts:
      /data/db from mongodb (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-mdqdk (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  PodScheduled 	False
Volumes:
  mongodb:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	mongodb-mongodb-0
    ReadOnly:	false
  default-token-mdqdk:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-mdqdk
QoS Class:	BestEffort
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----			-------------	--------	------			-------
  15m		15m		3	{default-scheduler }			Warning		FailedScheduling	[SchedulerPredicates failed due to PersistentVolume 'pvc-a5f5e714-c8a8-11e6-9924-069508572ed2' not found, which is unexpected., SchedulerPredicates failed due to PersistentVolume 'pvc-a5f5e714-c8a8-11e6-9924-069508572ed2' not found, which is unexpected.]
  14m		14m		1	{default-scheduler }			Warning		FailedScheduling	[SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "mongodb-mongodb-0", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "mongodb-mongodb-0", which is unexpected.]
  14m		2s		52	{default-scheduler }			Warning		FailedScheduling	pod (mongodb-0) failed to fit in any node
fit failure summary on nodes : NoVolumeZoneConflict (2)

The nodes, all in us-east-1e:

$ kubectl describe nodes | grep zone
			failure-domain.beta.kubernetes.io/zone=us-east-1e
			failure-domain.beta.kubernetes.io/zone=us-east-1e
			failure-domain.beta.kubernetes.io/zone=us-east-1e
			failure-domain.beta.kubernetes.io/zone=us-east-1e

What you expected to happen:

Dynamic volume provisioning should have created the required volume in the us-east-1e availability zone.

How to reproduce it (as minimally and precisely as possible):

Add the following storage class to the cluster:

---
kind: "StorageClass"
apiVersion: "storage.k8s.io/v1beta1"
metadata:
  name: "standard"
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: "kubernetes.io/aws-ebs"
parameters:
  type: "gp2"
  encrypted: "true"

Create the following stateful set and service:

---
kind: "Namespace"
apiVersion: "v1"
metadata:
  name: "errbit"
---
kind: "Service"
apiVersion: "v1"
metadata:
  name: "mongodb"
  namespace: "errbit"
  labels:
    app: "errbit"
    component: "mongodb"
spec:
  ports:
    - name: "mongodb"
      port: 27017
  clusterIP: "None"
  selector:
    app: "errbit"
    component: "mongodb"
---
kind: "StatefulSet"
apiVersion: "apps/v1beta1"
metadata:
  name: "mongodb"
  namespace: "errbit"
  labels:
    app: "errbit"
    component: "mongodb"
spec:
  serviceName: "mongodb"
  replicas: 1
  template:
    metadata:
      labels:
        app: "errbit"
        component: "mongodb"
    spec:
      containers:
        - name: "mongodb"
          image: "mongo:3.4.0"
          args:
            - "--auth"
          ports:
            - containerPort: 27017
              name: "mongodb"
          volumeMounts:
            - name: "mongodb"
              mountPath: "/data/db"
  volumeClaimTemplates:
    - metadata:
        name: "mongodb"
      spec:
        accessModes:
          - "ReadWriteOnce"
        resources:
          requests:
            storage: "1Gi"

About this issue

Original URL
State: closed
Created 8 years ago
Reactions: 47
Comments: 79 (50 by maintainers)

Links to this issue

For those of you who use Kubernetes, how the hell do you get anything done? : devops

Commits related to this issue

Change AWS resource tags with the key "Cluster" to "KubernetesCluster". A tag with this specific key is expected by Kubernetes cloud provider logic and used to determine which cluster a given AWS res... — committed to InQuicker/kaws by jimmycuadra 7 years ago

Most upvoted comments

To add a twist to this… I’m provisioning the Gitlab helm chart (from the Gitlab repository) which provisions 3 PVCs which are used by a single pod. My nodes are in AZs us-east-1a and us-east-1c. The dynamically created PVs are each created in a random AZ (I assume either us-east-1a or us-east-1c) so the pod can only be created if all the PVs by chance happen to be created in the same AZ. Most of the time they’re created in different AZs so there’s no one AZ the pod can be created in which satisfies the NoVolumeZoneConflict predicate. Seems to me the scheduler should keep all volumes for a single deployment within the same AZ.

+10

andrewmyhre on Jul 10, 2017

I think I see where the issue is. The docs for EBS provisioning say:

zone: AWS zone. If not specified, a random zone from those where Kubernetes cluster has a node is chosen.

However, I have not found any logic that chooses the zone that way.

aws.Cloud.CreateDisk calls its own aws.Cloud.getAllZones method to populate the list of zones to choose from when creating a disk when the storage class/PVC doesn’t request a specific zone. But getAllZones gets the zones of all EC2 instances, not filtered to the Kubernetes cluster by any means at all. This list of zones is passed to volume.util. ChooseZoneForVolume to pick a zone from the collection, but that function only attempts to distribute PVs across the provided zones. As such, if you have any EC2 instances running in a zone other than where your Kubernetes nodes are running, Kubernetes may pick the wrong zone.

+10

jimmycuadra on Dec 30, 2016

Any updates from the storage and/or AWS teams on this? We’re currently unable to use dynamic volume provisioning because of this problem.

jimmycuadra on Feb 28, 2017

Feature issue is here: https://github.com/kubernetes/enhancements/issues/490 1.12 blog post with examples is here: https://kubernetes.io/blog/2018/10/11/topology-aware-volume-provisioning-in-kubernetes/ Official documentation is here: https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode

msau42 on Nov 5, 2018

It looks like PVC logic will also create the volume in the wrong availability zone if you specify a nodeSelector on the pod to attach it to.

There must be a better way to select the availability zone than querying the AWS API for the KubernetesCluster tag so that kubernetes pod placement logic is actually considered in the process.

PaulFurtado on Jun 2, 2017

@jimmycuadra, you correctly found getAllZones, however you missed the part where it filters out all instances that are not tagged with “KubernetesCluster” tag with a specific value, it’s well hidden 😃.

So, tag all your AWS instances that are part of your cluster with “KubernetesCluster=jimmy” (incl. masters!) and restart Kubernetes. It should create volumes only in zones where there is an instance with the tag. You can run multiple clusters under one AWS project, as long as they have different values of KubernetesCluster tag.

@justinsb, btw, is it documented anywhere?

jsafrane on Jan 10, 2017

+1 , happened to me too, trying to setup mongodb from a helm chart. I have a few test nodes of K8s, all in same zone, but the volumes it provisioned are in another zone, so creation of pods got stuck on fit failure summary on nodes : NoVolumeZoneConflict (2), PodToleratesNodeTaints (1)

the only way to overcome this for now is to have minions in all zones, so one of them can accept your dynamic volume with the pod? It’s still a problem on AWS… if I store a few TB of data on a volume, and trust it to be migrated during failover to another node in cluster (where failed pod will be re-created), I will be surprised to see it stuck, because K8s will try to launch a pod on any other node, with no regards to its AZ.

But this issue is related more to cloud provider than K8s itself … maybe on GCE it will not happen.

Dmitry1987 on Dec 31, 2016

Any updates on this issue?

I’m dealing with a similar situation… I’ve got 3 nodes, spread across 2 zones. I’ve then got a StatefulSet which deploys 3 pods, that have an antiAffinity with each other(so 1 per node). The StatefulSet also provisions one persistent volume per pod. So far every attempt to deploy has resulted in the following Scenario(Or it’s reverse)…

Zone A: Node1, Node2, Pod1, Pod2, Volume1 Zone B: Node3, Pod3, Volume2, Volume3

The ratio of Volumes to Pods has been wrong every time which results in a failure(NoVolumeZoneConflict).

Aside from making all the nodes and volumes spin up in just one zone(which defeats the purpose) I’ve yet to think of/find a solution.

wrstlrlp13 on Mar 27, 2018

@msau42 In my experience, it isn’t simply that volumes are provisioned in availability zones where nodes do not exist, but also that pods are scheduled independently from PVs (on statefulset creation), then PVs are created without regard for scheduled pod locations. Later, if a pod is reaped, there’s no guarantee that it will be rescheduled in an AZ that matches the existing PV.

My best effort work around has been to create custom storage classes to pin the statefulset to a zone, which negates the benefits of a cluster that spans multiple AZs.

StephanX on Mar 27, 2018

Would it be possible for the cloud provider to just use the Kubernetes labels on nodes to determine a zone, rather than using AWS API calls to try to determine which nodes should be used? It would need to look for any schedulable nodes (i.e. not --register-schedulable=false) and look at the failure-domain.beta.kubernetes.io/region and failure-domain.beta.kubernetes.io/zone labels.

jimmycuadra on Jan 19, 2017

The only workarounds I can think of are:

deploy your nodes across all the zones
explicitly deploy those nodes into specific zones, and restrict the storage classes to those zones
manually precreate the PVs in the correct zones before creating the statefulset

Multi-zonal dynamic provisioning + pod constraints on zones do not work at all right now.

msau42 on Nov 7, 2017

You don’t need to explicitly constrain your pods to east-1a. The pods will be constrained automatically by the zone where the PV is. If you use the same spec for multiple deployments, then yes, you will need separate ones with different storage classes per zone.

For spreading your replicas across multiple zones, then you should use the zones parameter in the StorageClass. That will restrict the zone spreading to only those zones you specify, instead of the default of all zones in the region.

msau42 on Nov 7, 2017

just like @andrewmyhre I am provisioning jenkins which provisions two PVC’s which are being used in the single pod. I have two nodes both in different AZ’s. And PV’s are being created randomly in different AZ’s and since they are never created in same AZ; therefore my pod fails to start.

rasheedamir on Jul 10, 2017

I’m afraid I can’t confirm that adding the KubernetesCluster tag with a unique value per cluster results in the behavior I’d expect. I’ve tagged our clusters accordingly, restarted the Kubernetes components (apiservers, controller managers, and schedulers), and created a new stateful set with the same configuration from the issue description, but Kubernetes still creates the PV in us-east-1a despite all nodes being in us-east-1e.

jimmycuadra on Jan 18, 2017

@msau42 Does the new feature handle the case where a node with an existing pod/volume goes down and the pod gets moved to a node in a different AZ? Specifically, does Kubernetes handle copying the existing volume to a new volume in the new AZ? If not, when the pod gets redeployed in a new AZ, it would no longer have access to any data from the old volume.

I reviewed both the blog post and the official documentation, but I didn’t see anything that addressed this specific case. Thanks!

bartelsb on Feb 19, 2019

@msau42 this one looks resolved

lig on Feb 4, 2019

@StephanX agree, but because the solutions for the two are completely different, I want to split them out into separate issues and track them separately.

msau42 on Mar 27, 2018

Hm I don’t think it is. I’ll see where may be the best place to add it. Maybe in the multizone page.

msau42 on Aug 8, 2017

@msau42 is that documented somewhere?

Vince-Cercury on Aug 8, 2017

The current workaround for the multiple PVC zone spreading issue is to use a Statefulset, which has special zone spreading logic.

msau42 on Aug 3, 2017

FYI, the issue of integrating pod scheduling with PV binding and dynamic provisioning is being tracked here #43504. This will hopefully solve the multi-PVC In a pod issue.

However, here, there still is an issue that the AWS cloud provider library is not returning the correct zones for the cluster.

msau42 on Aug 3, 2017

kaws is our own installation system we’ve been using from the start. I don’t think this bug has anything to do with the cluster creation tool. The correct cloud provider flags are passed to each Kubernetes component and other AWS-specific cloud provider functionality works.

jimmycuadra on Dec 28, 2016