kubernetes: Dynamic volume provisioning creates EBS volume in the wrong availability zone
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): dynamic volume provisioning
Is this a BUG REPORT or FEATURE REQUEST? (choose one): bug report
Kubernetes version (use kubectl version
):
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.1", GitCommit:"82450d03cb057bab0950214ef122b67c83fb11df", GitTreeState:"clean", BuildDate:"2016-12-22T13:59:22Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.1", GitCommit:"82450d03cb057bab0950214ef122b67c83fb11df", GitTreeState:"clean", BuildDate:"2016-12-14T00:52:01Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Environment:
- Cloud provider or hardware configuration: AWS
- OS (e.g. from /etc/os-release): CoreOS 1185.5.0
- Kernel (e.g.
uname -a
): Linux ip-10-0-1-121.ec2.internal 4.7.3-coreos-r3 #1 SMP Wed Dec 7 09:29:55 UTC 2016 x86_64 Intel® Xeon® CPU E5-2676 v3 @ 2.40GHz GenuineIntel GNU/Linux - Install tools: kaws
- Others:
What happened:
Created a stateful set with a persistent volume claim. Dynamic volume provisioning created an EBS volume in the us-east-1a availability zone, despite all the masters and nodes in the cluster being in us-east-1e. I tried it twice with the same results both times.
The PVC:
$ kubectl describe pvc -n errbit
Name: mongodb-mongodb-0
Namespace: errbit
StorageClass: standard
Status: Bound
Volume: pvc-5fc8e90b-c8aa-11e6-9924-069508572ed2
Labels: app=errbit
component=mongodb
Capacity: 1Gi
Access Modes: RWO
No events.
The PV created by dynamic provisioning:
$ kubectl describe pv
Name: pvc-5fc8e90b-c8aa-11e6-9924-069508572ed2
Labels: failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1a
StorageClass: standard
Status: Bound
Claim: errbit/mongodb-mongodb-0
Reclaim Policy: Delete
Access Modes: RWO
Capacity: 1Gi
Message:
Source:
Type: AWSElasticBlockStore (a Persistent Disk resource in AWS)
VolumeID: aws://us-east-1a/vol-0d32c25dc73f029af
FSType: ext4
Partition: 0
ReadOnly: false
No events.
The stateful set:
$ kubectl describe statefulset mongodb -n errbit
Name: mongodb
Namespace: errbit
Image(s): mongo:3.4.0
Selector: app=errbit,component=mongodb
Labels: app=errbit,component=mongodb
Replicas: 1 current / 1 desired
Annotations: kubectl.kubernetes.io/last-applied-configuration={"kind":"StatefulSet","apiVersion":"apps/v1beta1","metadata":{"name":"mongodb","namespace":"errbit","creationTimestamp":null,"labels":{"app":"errbit","component":"mongodb"}},"spec":{"replicas":1,"template":{"metadata":{"creationTimestamp":null,"labels":{"app":"errbit","component":"mongodb"}},"spec":{"containers":[{"name":"mongodb","image":"mongo:3.4.0","args":["--auth"],"ports":[{"name":"mongodb","containerPort":27017}],"resources":{},"volumeMounts":[{"name":"mongodb","mountPath":"/data/db"}]}]}},"volumeClaimTemplates":[{"metadata":{"name":"mongodb","creationTimestamp":null},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"1Gi"}}},"status":{}}],"serviceName":"mongodb"},"status":{"replicas":0}}
CreationTimestamp: Thu, 22 Dec 2016 16:53:24 -0800
Pods Status: 0 Running / 1 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
12m 12m 1 {statefulset } Normal SuccessfulCreate pet: mongodb-0
11m 11m 1 {statefulset } Normal SuccessfulCreate pvc: mongodb-mongodb-0
The stateful set’s pod, pending due to the volume being in the wrong zone:
$ kubectl describe pod mongodb-0 -n errbit
Name: mongodb-0
Namespace: errbit
Node: /
Labels: app=errbit
component=mongodb
Status: Pending
IP:
Controllers: StatefulSet/mongodb
Containers:
mongodb:
Image: mongo:3.4.0
Port: 27017/TCP
Args:
--auth
Volume Mounts:
/data/db from mongodb (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-mdqdk (ro)
Environment Variables: <none>
Conditions:
Type Status
PodScheduled False
Volumes:
mongodb:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: mongodb-mongodb-0
ReadOnly: false
default-token-mdqdk:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-mdqdk
QoS Class: BestEffort
Tolerations: <none>
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
15m 15m 3 {default-scheduler } Warning FailedScheduling [SchedulerPredicates failed due to PersistentVolume 'pvc-a5f5e714-c8a8-11e6-9924-069508572ed2' not found, which is unexpected., SchedulerPredicates failed due to PersistentVolume 'pvc-a5f5e714-c8a8-11e6-9924-069508572ed2' not found, which is unexpected.]
14m 14m 1 {default-scheduler } Warning FailedScheduling [SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "mongodb-mongodb-0", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "mongodb-mongodb-0", which is unexpected.]
14m 2s 52 {default-scheduler } Warning FailedScheduling pod (mongodb-0) failed to fit in any node
fit failure summary on nodes : NoVolumeZoneConflict (2)
The nodes, all in us-east-1e:
$ kubectl describe nodes | grep zone
failure-domain.beta.kubernetes.io/zone=us-east-1e
failure-domain.beta.kubernetes.io/zone=us-east-1e
failure-domain.beta.kubernetes.io/zone=us-east-1e
failure-domain.beta.kubernetes.io/zone=us-east-1e
What you expected to happen:
Dynamic volume provisioning should have created the required volume in the us-east-1e availability zone.
How to reproduce it (as minimally and precisely as possible):
Add the following storage class to the cluster:
---
kind: "StorageClass"
apiVersion: "storage.k8s.io/v1beta1"
metadata:
name: "standard"
annotations:
storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: "kubernetes.io/aws-ebs"
parameters:
type: "gp2"
encrypted: "true"
Create the following stateful set and service:
---
kind: "Namespace"
apiVersion: "v1"
metadata:
name: "errbit"
---
kind: "Service"
apiVersion: "v1"
metadata:
name: "mongodb"
namespace: "errbit"
labels:
app: "errbit"
component: "mongodb"
spec:
ports:
- name: "mongodb"
port: 27017
clusterIP: "None"
selector:
app: "errbit"
component: "mongodb"
---
kind: "StatefulSet"
apiVersion: "apps/v1beta1"
metadata:
name: "mongodb"
namespace: "errbit"
labels:
app: "errbit"
component: "mongodb"
spec:
serviceName: "mongodb"
replicas: 1
template:
metadata:
labels:
app: "errbit"
component: "mongodb"
spec:
containers:
- name: "mongodb"
image: "mongo:3.4.0"
args:
- "--auth"
ports:
- containerPort: 27017
name: "mongodb"
volumeMounts:
- name: "mongodb"
mountPath: "/data/db"
volumeClaimTemplates:
- metadata:
name: "mongodb"
spec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: "1Gi"
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 47
- Comments: 79 (50 by maintainers)
To add a twist to this… I’m provisioning the Gitlab helm chart (from the Gitlab repository) which provisions 3 PVCs which are used by a single pod. My nodes are in AZs us-east-1a and us-east-1c. The dynamically created PVs are each created in a random AZ (I assume either us-east-1a or us-east-1c) so the pod can only be created if all the PVs by chance happen to be created in the same AZ. Most of the time they’re created in different AZs so there’s no one AZ the pod can be created in which satisfies the NoVolumeZoneConflict predicate. Seems to me the scheduler should keep all volumes for a single deployment within the same AZ.
I think I see where the issue is. The docs for EBS provisioning say:
However, I have not found any logic that chooses the zone that way.
aws.Cloud.CreateDisk calls its own aws.Cloud.getAllZones method to populate the list of zones to choose from when creating a disk when the storage class/PVC doesn’t request a specific zone. But
getAllZones
gets the zones of all EC2 instances, not filtered to the Kubernetes cluster by any means at all. This list of zones is passed to volume.util. ChooseZoneForVolume to pick a zone from the collection, but that function only attempts to distribute PVs across the provided zones. As such, if you have any EC2 instances running in a zone other than where your Kubernetes nodes are running, Kubernetes may pick the wrong zone.Any updates from the storage and/or AWS teams on this? We’re currently unable to use dynamic volume provisioning because of this problem.
Feature issue is here: https://github.com/kubernetes/enhancements/issues/490 1.12 blog post with examples is here: https://kubernetes.io/blog/2018/10/11/topology-aware-volume-provisioning-in-kubernetes/ Official documentation is here: https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode
It looks like PVC logic will also create the volume in the wrong availability zone if you specify a nodeSelector on the pod to attach it to.
There must be a better way to select the availability zone than querying the AWS API for the KubernetesCluster tag so that kubernetes pod placement logic is actually considered in the process.
@jimmycuadra, you correctly found
getAllZones
, however you missed the part where it filters out all instances that are not tagged with “KubernetesCluster” tag with a specific value, it’s well hidden 😃.So, tag all your AWS instances that are part of your cluster with “KubernetesCluster=jimmy” (incl. masters!) and restart Kubernetes. It should create volumes only in zones where there is an instance with the tag. You can run multiple clusters under one AWS project, as long as they have different values of KubernetesCluster tag.
@justinsb, btw, is it documented anywhere?
+1 , happened to me too, trying to setup mongodb from a helm chart. I have a few test nodes of K8s, all in same zone, but the volumes it provisioned are in another zone, so creation of pods got stuck on
fit failure summary on nodes : NoVolumeZoneConflict (2), PodToleratesNodeTaints (1)
the only way to overcome this for now is to have minions in all zones, so one of them can accept your dynamic volume with the pod? It’s still a problem on AWS… if I store a few TB of data on a volume, and trust it to be migrated during failover to another node in cluster (where failed pod will be re-created), I will be surprised to see it stuck, because K8s will try to launch a pod on any other node, with no regards to its AZ.
But this issue is related more to cloud provider than K8s itself … maybe on GCE it will not happen.
Any updates on this issue?
I’m dealing with a similar situation… I’ve got 3 nodes, spread across 2 zones. I’ve then got a StatefulSet which deploys 3 pods, that have an antiAffinity with each other(so 1 per node). The StatefulSet also provisions one persistent volume per pod. So far every attempt to deploy has resulted in the following Scenario(Or it’s reverse)…
Zone A: Node1, Node2, Pod1, Pod2, Volume1 Zone B: Node3, Pod3, Volume2, Volume3
The ratio of Volumes to Pods has been wrong every time which results in a failure(NoVolumeZoneConflict).
Aside from making all the nodes and volumes spin up in just one zone(which defeats the purpose) I’ve yet to think of/find a solution.
@msau42 In my experience, it isn’t simply that volumes are provisioned in availability zones where nodes do not exist, but also that pods are scheduled independently from PVs (on statefulset creation), then PVs are created without regard for scheduled pod locations. Later, if a pod is reaped, there’s no guarantee that it will be rescheduled in an AZ that matches the existing PV.
My best effort work around has been to create custom storage classes to pin the statefulset to a zone, which negates the benefits of a cluster that spans multiple AZs.
Would it be possible for the cloud provider to just use the Kubernetes labels on nodes to determine a zone, rather than using AWS API calls to try to determine which nodes should be used? It would need to look for any schedulable nodes (i.e. not
--register-schedulable=false
) and look at thefailure-domain.beta.kubernetes.io/region
andfailure-domain.beta.kubernetes.io/zone
labels.The only workarounds I can think of are:
Multi-zonal dynamic provisioning + pod constraints on zones do not work at all right now.
You don’t need to explicitly constrain your pods to east-1a. The pods will be constrained automatically by the zone where the PV is. If you use the same spec for multiple deployments, then yes, you will need separate ones with different storage classes per zone.
For spreading your replicas across multiple zones, then you should use the
zones
parameter in the StorageClass. That will restrict the zone spreading to only those zones you specify, instead of the default of all zones in the region.just like @andrewmyhre I am provisioning jenkins which provisions two PVC’s which are being used in the single pod. I have two nodes both in different AZ’s. And PV’s are being created randomly in different AZ’s and since they are never created in same AZ; therefore my pod fails to start.
I’m afraid I can’t confirm that adding the
KubernetesCluster
tag with a unique value per cluster results in the behavior I’d expect. I’ve tagged our clusters accordingly, restarted the Kubernetes components (apiservers, controller managers, and schedulers), and created a new stateful set with the same configuration from the issue description, but Kubernetes still creates the PV in us-east-1a despite all nodes being in us-east-1e.@msau42 Does the new feature handle the case where a node with an existing pod/volume goes down and the pod gets moved to a node in a different AZ? Specifically, does Kubernetes handle copying the existing volume to a new volume in the new AZ? If not, when the pod gets redeployed in a new AZ, it would no longer have access to any data from the old volume.
I reviewed both the blog post and the official documentation, but I didn’t see anything that addressed this specific case. Thanks!
@msau42 this one looks resolved
@StephanX agree, but because the solutions for the two are completely different, I want to split them out into separate issues and track them separately.
Hm I don’t think it is. I’ll see where may be the best place to add it. Maybe in the multizone page.
@msau42 is that documented somewhere?
The current workaround for the multiple PVC zone spreading issue is to use a Statefulset, which has special zone spreading logic.
FYI, the issue of integrating pod scheduling with PV binding and dynamic provisioning is being tracked here #43504. This will hopefully solve the multi-PVC In a pod issue.
However, here, there still is an issue that the AWS cloud provider library is not returning the correct zones for the cluster.
kaws is our own installation system we’ve been using from the start. I don’t think this bug has anything to do with the cluster creation tool. The correct cloud provider flags are passed to each Kubernetes component and other AWS-specific cloud provider functionality works.