kubernetes: Timeout expired waiting for volumes to attach
/kind bug /sig azure /sig storage
What happened:
I’m triggering docker slaves from my jenkins master. Jenkins connectes kubernetes services and starts kubernetes pods. This pods requires some disks for persistent data(compile results actually). I’m connecting azure premium managed disk to this slaves(pods). Connecting disks takes at least 10 minutes. I’m using D4s_v3 - 4vCPU - 16Ram Virtual Machines.
Here is my kubelet logs :
Aug 06 12:09:35 aks-agentpool-12873730-1 kubelet[1756]: I0806 12:09:35.906299 1756 kubelet_node_status.go:491] Using Node Hostname from cloudprovider: "aks-agentpool-12873730-1"
Aug 06 12:09:29 aks-agentpool-12873730-1 kubelet[1756]: E0806 12:09:29.965335 1756 file.go:76] Unable to read manifest path "/etc/kubernetes/manifests": path does not exist, ignoring
Aug 06 12:09:26 aks-agentpool-12873730-1 kubelet[1756]: I0806 12:09:26.613745 1756 server.go:796] GET /stats/summary: (7.883039ms) 200 [[Ruby] 10.244.5.60:44376]
Aug 06 12:09:25 aks-agentpool-12873730-1 kubelet[1756]: I0806 12:09:25.781823 1756 kubelet_node_status.go:491] Using Node Hostname from cloudprovider: "aks-agentpool-12873730-1"
Aug 06 12:09:21 aks-agentpool-12873730-1 kubelet[1756]: E0806 12:09:21.092330 1756 pod_workers.go:186] Error syncing pod f13eb1ad-9970-11e8-8f58-0a58ac1f107d ("yocto-agent-10ssr-3x38j_jen
kins(f13eb1ad-9970-11e8-8f58-0a58ac1f107d)"), skipping: timeout expired waiting for volumes to attach or mount for pod "jenkins"/"yocto-agent-10ssr-3x38j". list of unmounted volumes=[output]
. list of unattached volumes=[output workspace-volume default-token-vx9mp]
Aug 06 12:09:21 aks-agentpool-12873730-1 kubelet[1756]: E0806 12:09:21.092261 1756 kubelet.go:1640] Unable to mount volumes for pod "yocto-agent-10ssr-3x38j_jenkins(f13eb1ad-9970-11e8-8f5
8-0a58ac1f107d)": timeout expired waiting for volumes to attach or mount for pod "jenkins"/"yocto-agent-10ssr-3x38j". list of unmounted volumes=[output]. list of unattached volumes=[output w
orkspace-volume default-token-vx9mp]; skipping pod
Aug 06 12:09:15 aks-agentpool-12873730-1 kubelet[1756]: I0806 12:09:15.747747 1756 kubelet_node_status.go:491] Using Node Hostname from cloudprovider: "aks-agentpool-12873730-1"
Aug 06 12:09:09 aks-agentpool-12873730-1 kubelet[1756]: E0806 12:09:09.965090 1756 file.go:76] Unable to read manifest path "/etc/kubernetes/manifests": path does not exist, ignoring
Aug 06 12:09:05 aks-agentpool-12873730-1 kubelet[1756]: I0806 12:09:05.710028 1756 kubelet_node_status.go:491] Using Node Hostname from cloudprovider: "aks-agentpool-12873730-1"
Pod
Name: yocto-agent-10ssr-sjcxn
Namespace: jenkins
Node: aks-agentpool-12873730-1/10.240.0.5
Start Time: Mon, 06 Aug 2018 14:58:09 +0300
Labels: jenkins=slave
jenkins/yocto-agent=true
Annotations: <none>
Status: Terminating (lasts 14m)
Termination Grace Period: 30s
IP:
Containers:
slave:
Container ID:
Image: *********/builder-yocto-slave:3
Image ID:
Port: <none>
Host Port: <none>
Command:
cat
State: Terminated
Exit Code: 0
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/home/jenkins from workspace-volume (rw)
/output from output (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-vx9mp (ro)
jnlp:
Container ID:
Image: jenkins/jnlp-slave:alpine
Image ID:
Port: <none>
Host Port: <none>
State: Terminated
Exit Code: 0
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 0
Environment:
JENKINS_SECRET: **************
JENKINS_NAME: yocto-agent-10ssr-sjcxn
JENKINS_URL: ***************
HOME: /home/jenkins
Mounts:
/home/jenkins from workspace-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-vx9mp (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
output:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: yocto-premium-output
ReadOnly: false
workspace-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
default-token-vx9mp:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-vx9mp
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 22m default-scheduler Successfully assigned yocto-agent-10ssr-sjcxn to aks-agentpool-12873730-1
Normal SuccessfulMountVolume 22m kubelet, aks-agentpool-12873730-1 MountVolume.SetUp succeeded for volume "workspace-volume"
Normal SuccessfulMountVolume 22m kubelet, aks-agentpool-12873730-1 MountVolume.SetUp succeeded for volume "default-token-vx9mp"
Normal SuccessfulAttachVolume 21m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-f2ad9da3-90a6-11e8-a8f5-0a58ac1f149a"
Warning FailedMount 15m (x3 over 20m) kubelet, aks-agentpool-12873730-1 Unable to mount volumes for pod "yocto-agent-10ssr-sjcxn_jenkins(fce1a7aa-996f-11e8-8f58-0a58ac1f107d)": timeout expired waiting for volumes to attach or mount for pod "jenkins"/"yocto-agent-10ssr-sjcxn". list of unmounted volumes=[output]. list of unattached volumes=[output workspace-volume default-token-vx9mp]
PV / PVC Log
Name: yocto-premium-output
Namespace: jenkins
StorageClass: managed-premium
Status: Bound
Volume: pvc-f2ad9da3-90a6-11e8-a8f5-0a58ac1f149a
Labels: <none>
Annotations: pv.kubernetes.io/bind-completed=yes
pv.kubernetes.io/bound-by-controller=yes
volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/azure-disk
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 100Gi
Access Modes: RWO
Events: <none>
Environment:
- Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.6", GitCommit:"a21fdbd78dde8f5447f5f6c331f7eb6f80bd684e", GitTreeState:"clean", BuildDate:"2018-07-26T10:04:08Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration:
Azure Kubernetes Service
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 47 (17 by maintainers)
We are experiencing a similar issue.
We are running
jenkinson GKE 1.16.We are using the official helm chart to deploy it.
Very often, after a restart, the master pod fails to go into
Runningstate because of failed mount error.The problem usually gets resolved by itself after a long time, usually (a lot more than) 10 minutes.
I am posting in this thread although from the discussion and if I am not mistaken it seems that the topic discussed relates more to the slave pods?
@v-braun for chown slowness issue, you could use following workaround: Following setting will set pod run as root, make it accessible to any file:
supplementalGroups(not set fsGroup in your Pod spec and instead set supplementalGroup to match the gid of the volume):OK… I did some digging around to get a better understanding of the problem and figured I would post it here for others that run into similar issues with long volume mount times. Apologies in advance, I know little to no GoLang.
If I’m reading the Kubernetes code correctly, The Azure Managed Disk extension (code?) causing the delay appears to be a call to SetVolumeOwnership(). Not sure if
azureddis the right place for Azure Managed Disk.SetVolumeOwnership() traverses the the entire file system when
fsGroupis set. This can can be avoided iffsGroupChangePolicyis set toOnRootMismatchin the pod security context. This is assuming that the volume’s root permissions match.So the more files on the filesystem, the longer it is going to take. In our case we have over 2,000,000 files in on the volume on a light day, and why it would take anywhere from 25 to 40 minutes the pod to gain access to the volume.
To fix this, the choices are either using:
supplementalGroupsinstead offsGroupin the pod security context.fsGroup, which is not recommended for security reasons.fsGroup, try settingfsGroupChangePolicytoOnRootMismatch, I have not tested this out, but if I’m reading the code correctly this should work as well.Some quick looking around in
kubernetes/kubernetes/pkg/volumeit seems likeSetVolumeOwnership()is used in all the volume types except for:azure_file,cephfs,csimigration,glusterfs,hostpath,nfs,quobyte. That list should not be considered definitive though.@pedrohdz: You are right. They introduced the
fsGroupChangePolicyin Kubernetes 1.18 as an alpha feature and with 1.20. it became beta.Fixed by updating
securityContexton Alibaba cloud, Kubernetes version 1.9.3:@hmenn It’s not necessary since it’s a common issue for mounting ex4 file system, not related to k8s. I have recorded your issue in my doc. Would you close this issue? Thanks.
Seems like the issue was introduced into Google Cloud Platform on the 2020-04-20 (🌈 420). Source.
It’s difficult to establish causality here, but I started facing this issue after updating to from
1.17.17-gke.3700to1.18.16-gke.502, and got fixed when updating from1.18.16-gke.502to1.19.8-gke.1600.It looks like your two pods yocto-agent-z0xws-cjl99 and yocto-agent-z0xws-lhn85 are using same PVC disk: yocto-premium-output, it’s not allowed to do that, azure disk is RWO
Hi @hmenn , has it recovered now? It looks like mount timeout is due to long time disk attachment (expected time is less than 1 min). And what’s your k8s region, then?