kubernetes: Timeout expired waiting for volumes to attach

/kind bug /sig azure /sig storage

What happened:

I’m triggering docker slaves from my jenkins master. Jenkins connectes kubernetes services and starts kubernetes pods. This pods requires some disks for persistent data(compile results actually). I’m connecting azure premium managed disk to this slaves(pods). Connecting disks takes at least 10 minutes. I’m using D4s_v3 - 4vCPU - 16Ram Virtual Machines.

Here is my kubelet logs :

Aug 06 12:09:35 aks-agentpool-12873730-1 kubelet[1756]: I0806 12:09:35.906299    1756 kubelet_node_status.go:491] Using Node Hostname from cloudprovider: "aks-agentpool-12873730-1"

Aug 06 12:09:29 aks-agentpool-12873730-1 kubelet[1756]: E0806 12:09:29.965335    1756 file.go:76] Unable to read manifest path "/etc/kubernetes/manifests": path does not exist, ignoring
Aug 06 12:09:26 aks-agentpool-12873730-1 kubelet[1756]: I0806 12:09:26.613745    1756 server.go:796] GET /stats/summary: (7.883039ms) 200 [[Ruby] 10.244.5.60:44376]
Aug 06 12:09:25 aks-agentpool-12873730-1 kubelet[1756]: I0806 12:09:25.781823    1756 kubelet_node_status.go:491] Using Node Hostname from cloudprovider: "aks-agentpool-12873730-1"
Aug 06 12:09:21 aks-agentpool-12873730-1 kubelet[1756]: E0806 12:09:21.092330    1756 pod_workers.go:186] Error syncing pod f13eb1ad-9970-11e8-8f58-0a58ac1f107d ("yocto-agent-10ssr-3x38j_jen
kins(f13eb1ad-9970-11e8-8f58-0a58ac1f107d)"), skipping: timeout expired waiting for volumes to attach or mount for pod "jenkins"/"yocto-agent-10ssr-3x38j". list of unmounted volumes=[output]
. list of unattached volumes=[output workspace-volume default-token-vx9mp]
Aug 06 12:09:21 aks-agentpool-12873730-1 kubelet[1756]: E0806 12:09:21.092261    1756 kubelet.go:1640] Unable to mount volumes for pod "yocto-agent-10ssr-3x38j_jenkins(f13eb1ad-9970-11e8-8f5
8-0a58ac1f107d)": timeout expired waiting for volumes to attach or mount for pod "jenkins"/"yocto-agent-10ssr-3x38j". list of unmounted volumes=[output]. list of unattached volumes=[output w
orkspace-volume default-token-vx9mp]; skipping pod
Aug 06 12:09:15 aks-agentpool-12873730-1 kubelet[1756]: I0806 12:09:15.747747    1756 kubelet_node_status.go:491] Using Node Hostname from cloudprovider: "aks-agentpool-12873730-1"
Aug 06 12:09:09 aks-agentpool-12873730-1 kubelet[1756]: E0806 12:09:09.965090    1756 file.go:76] Unable to read manifest path "/etc/kubernetes/manifests": path does not exist, ignoring
Aug 06 12:09:05 aks-agentpool-12873730-1 kubelet[1756]: I0806 12:09:05.710028    1756 kubelet_node_status.go:491] Using Node Hostname from cloudprovider: "aks-agentpool-12873730-1"

Pod

Name:                      yocto-agent-10ssr-sjcxn
Namespace:                 jenkins
Node:                      aks-agentpool-12873730-1/10.240.0.5
Start Time:                Mon, 06 Aug 2018 14:58:09 +0300
Labels:                    jenkins=slave
                           jenkins/yocto-agent=true
Annotations:               <none>
Status:                    Terminating (lasts 14m)
Termination Grace Period:  30s
IP:
Containers:
  slave:
    Container ID:
    Image:         *********/builder-yocto-slave:3
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      cat
    State:          Terminated
      Exit Code:    0
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /home/jenkins from workspace-volume (rw)
      /output from output (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-vx9mp (ro)
  jnlp:
    Container ID:
    Image:          jenkins/jnlp-slave:alpine
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Exit Code:    0
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Environment:
      JENKINS_SECRET:  **************
      JENKINS_NAME:    yocto-agent-10ssr-sjcxn
      JENKINS_URL:     ***************
      HOME:            /home/jenkins
    Mounts:
      /home/jenkins from workspace-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-vx9mp (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  output:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  yocto-premium-output
    ReadOnly:   false
  workspace-volume:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  default-token-vx9mp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-vx9mp
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                From                               Message
  ----     ------                  ----               ----                               -------
  Normal   Scheduled               22m                default-scheduler                  Successfully assigned yocto-agent-10ssr-sjcxn to aks-agentpool-12873730-1
  Normal   SuccessfulMountVolume   22m                kubelet, aks-agentpool-12873730-1  MountVolume.SetUp succeeded for volume "workspace-volume"
  Normal   SuccessfulMountVolume   22m                kubelet, aks-agentpool-12873730-1  MountVolume.SetUp succeeded for volume "default-token-vx9mp"
  Normal   SuccessfulAttachVolume  21m                attachdetach-controller            AttachVolume.Attach succeeded for volume "pvc-f2ad9da3-90a6-11e8-a8f5-0a58ac1f149a"
  Warning  FailedMount             15m (x3 over 20m)  kubelet, aks-agentpool-12873730-1  Unable to mount volumes for pod "yocto-agent-10ssr-sjcxn_jenkins(fce1a7aa-996f-11e8-8f58-0a58ac1f107d)": timeout expired waiting for volumes to attach or mount for pod "jenkins"/"yocto-agent-10ssr-sjcxn". list of unmounted volumes=[output]. list of unattached volumes=[output workspace-volume default-token-vx9mp]

PV / PVC Log

Name:          yocto-premium-output
Namespace:     jenkins
StorageClass:  managed-premium
Status:        Bound
Volume:        pvc-f2ad9da3-90a6-11e8-a8f5-0a58ac1f149a
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed=yes
               pv.kubernetes.io/bound-by-controller=yes
               volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/azure-disk
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      100Gi
Access Modes:  RWO
Events:        <none>

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.6", GitCommit:"a21fdbd78dde8f5447f5f6c331f7eb6f80bd684e", GitTreeState:"clean", BuildDate:"2018-07-26T10:04:08Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: Azure Kubernetes Service

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 2
Comments: 47 (17 by maintainers)

Commits related to this issue

Option to use or skip fsGroup in security context addresses https://github.com/apache/airflow/issues/13972 There is a bug in kubernetes where if you mount a large volume and there is an fsgroup set ... — committed to astronomer/airflow by dimberman 3 years ago

Most upvoted comments

We are experiencing a similar issue.

We are running jenkins on GKE 1.16.

We are using the official helm chart to deploy it.

Very often, after a restart, the master pod fails to go into Running state because of failed mount error.

  Normal   Scheduled    11m                  default-scheduler                                Successfully assigned jenkins/jenkins-cd-cb8b56fb7-ssp2j to gke-environment-nodepool1-3d1a1f31-vw87
  Warning  FailedMount  20s (x5 over 9m23s)  kubelet, gke-environment-nodepool1-3d1a1f31-vw87  Unable to mount volumes for pod "jenkins-cd-cb8b56fb7-ssp2j_jenkins(412467g16-2bd5-48f5-b536-5de124c65dc80)": timeout expired waiting for volumes to attach or mount for pod "jenkins"/"jenkins-cd-cb8b56fb7-ssp2j". list of unmounted volumes=[jenkins-home]. list of unattached volumes=[plugins tmp jenkins-config secrets-dir jenkins-home jenkins-cd-token-83l7h]

The problem usually gets resolved by itself after a long time, usually (a lot more than) 10 minutes.

I am posting in this thread although from the discussion and if I am not mistaken it seems that the topic discussed relates more to the slave pods?

+13

pantelis-karamolegkos on Jun 25, 2020

Lost so many hours on that! Thanks! Just a question: is it possible to increase the mountin timeout? My pod restarts after 2m and needs a lot of restarts till it stay alive. I think an increase of the timeout would help to improve the time because kubes chown can process till and and has not be re-executed multiple times.

@v-braun for chown slowness issue, you could use following workaround: Following setting will set pod run as root, make it accessible to any file:

apiVersion: v1
kind: Pod
metadata:
  name: security-context-demo
spec:
  securityContext:
    runAsUser: 0
    fsGroup: 0

another workaround is using supplementalGroups(not set fsGroup in your Pod spec and instead set supplementalGroup to match the gid of the volume):

apiVersion: v1
kind: Pod
...
spec:
  containers:
  - name: ...
    volumeMounts:
    - name: xxx 
      mountPath: /usr/share/... 
  securityContext: 
    supplementalGroups: [xxx] 
  volumes:

andyzhangx on Feb 22, 2020

OK… I did some digging around to get a better understanding of the problem and figured I would post it here for others that run into similar issues with long volume mount times. Apologies in advance, I know little to no GoLang.

If I’m reading the Kubernetes code correctly, The Azure Managed Disk extension (code?) causing the delay appears to be a call to SetVolumeOwnership(). Not sure if azuredd is the right place for Azure Managed Disk.

SetVolumeOwnership() traverses the the entire file system when fsGroup is set. This can can be avoided if fsGroupChangePolicy is set to OnRootMismatch in the pod security context. This is assuming that the volume’s root permissions match.

So the more files on the filesystem, the longer it is going to take. In our case we have over 2,000,000 files in on the volume on a light day, and why it would take anywhere from 25 to 40 minutes the pod to gain access to the volume.

To fix this, the choices are either using:

supplementalGroups instead of fsGroup in the pod security context.
set fsGroup, which is not recommended for security reasons.
if you have to use fsGroup, try setting fsGroupChangePolicy to OnRootMismatch, I have not tested this out, but if I’m reading the code correctly this should work as well.

Some quick looking around in kubernetes/kubernetes/pkg/volume it seems like SetVolumeOwnership() is used in all the volume types except for: azure_file, cephfs, csimigration, glusterfs, hostpath, nfs, quobyte. That list should not be considered definitive though.

pedrohdz on Jan 4, 2021

So the more files on the filesystem, the longer it is going to take. In our case we have over 2,000,000 files in on the volume on a light day, and why it would take anywhere from 25 to 40 minutes the pod to gain access to the volume.

To fix this, the choices are either using:
* `supplementalGroups` instead of `fsGroup` in the pod security context.

* set `fsGroup`, which is not recommended for security reasons.

* if you have to use `fsGroup`, try setting `fsGroupChangePolicy` to `OnRootMismatch`, _I have not tested this out_, but if I'm reading the code correctly this should work as well.

@pedrohdz: You are right. They introduced the fsGroupChangePolicy in Kubernetes 1.18 as an alpha feature and with 1.20. it became beta.

andreas-eberle on Jan 25, 2021

Fixed by updating securityContext on Alibaba cloud, Kubernetes version 1.9.3:

securityContext:
    ...
    privileged: true
    runAsUser: 0

huychau on Sep 25, 2020

@hmenn It’s not necessary since it’s a common issue for mounting ex4 file system, not related to k8s. I have recorded your issue in my doc. Would you close this issue? Thanks.

andyzhangx on Aug 17, 2018

Seems like the issue was introduced into Google Cloud Platform on the 2020-04-20 (🌈 420). Source.

It’s difficult to establish causality here, but I started facing this issue after updating to from 1.17.17-gke.3700 to 1.18.16-gke.502, and got fixed when updating from 1.18.16-gke.502 to 1.19.8-gke.1600.

gsusI on Apr 27, 2021

It looks like your two pods yocto-agent-z0xws-cjl99 and yocto-agent-z0xws-lhn85 are using same PVC disk: yocto-premium-output, it’s not allowed to do that, azure disk is RWO

andyzhangx on Aug 9, 2018

Hi @hmenn , has it recovered now? It looks like mount timeout is due to long time disk attachment (expected time is less than 1 min). And what’s your k8s region, then?

andyzhangx on Aug 8, 2018