aws-ebs-csi-driver: Windows CSI Node DaemonSet is not Running/ErrImagePull

/kind bug

What happened? Windows CSI Node DaemonSet is not running due to an wrong Image without windows support for windows/amd64.

Looks for me that it was not tested on EKS Windows Nodes.

  Warning  Failed     15s                kubelet            Failed to pull image "k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.1.3": rpc error: code = Unknown desc = no matching manifest for windows/amd64 10.0.17763 in the manifest list entries

What you expected to happen? Running Windows CSI Node DaemonSet

How to reproduce it (as minimally and precisely as possible)? Activate windows support in helm chart and spin up a Windows Node.

https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/values.yaml#L133

Anything else we need to know?:

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  18s                default-scheduler  Successfully assigned kube-system/ebs-csi-node-windows-pt422 to ip-10-46-67-23.eu-central-1.compute.internal
  Normal   Pulling    16s                kubelet            Pulling image "k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.1.3"
  Warning  Failed     15s                kubelet            Failed to pull image "k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.1.3": rpc error: code = Unknown desc = no matching manifest for windows/amd64 10.0.17763 in the manifest list entries
  Normal   Pulled     15s                kubelet            Container image "k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.1.0" already present on machine
  Normal   Created    15s                kubelet            Created container node-driver-registrar
  Normal   Started    15s                kubelet            Started container node-driver-registrar
  Normal   Pulled     15s                kubelet            Container image "k8s.gcr.io/sig-storage/livenessprobe:v2.2.0" already present on machine
  Normal   Created    14s                kubelet            Created container liveness-probe
  Normal   Started    14s                kubelet            Started container liveness-probe
  Warning  Failed     11s (x2 over 12s)  kubelet            Error: ImagePullBackOff

Environment

  • Kubernetes version (use kubectl version): 1.20
  • Driver version: v1.1.3
  • Helm Chart v2.0.2

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 36 (16 by maintainers)

Most upvoted comments

@wongma7 Hi, you released a new Container image that support Windows and Linux #957 , but make it really sense to put a Linux and Windows Layer in a single Container Image that supports multi OS? We running currently more then 600 Linux EKS nodes(arm64/amd64) and 30 Windows Nodes. It makes absolute sense to support multi arch in the Container Image, but i think it’s not an good idea to support multi OS in one Image, the overhead due to the different Container architecture is to large. All Linux nodes need to load the whole Container Image with the Windows Layer inside. That is a Hugh overhead per Node. As short example, 600 Linux Nodes x 2.5G for the Windows Layer = 1.5 TB overhead for the Windows Container Layer on Linux Nodes. It need also more Bandwidth, Storage and Time to load/start a new Container Image through a Update. I find it useful to split the images and to create a Linux and a Windows Container Image. It is also easier to support / integrate new features without affecting the other OS. What do you think?

OK, I thought there are 2 different errors so I would expect the logs to be different, but we can focus on the original one because that is from the latest version of the driver

error 1 from driver v1.3.0 “file does not exist”

  Warning  FailedMount             112s (x13 over 12m)  kubelet                  MountVolume.SetUp failed for volume "pvc-462e8ac4-5ce3-4fde-9398-d0b799d1b128" : rpc error: code = Internal desc = Could not mount "\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\pv\\pvc-462e8ac4-5ce3-4fde-9398-d0b799d1b128\\globalmount" at "c:\\var\\lib\\kubelet\\pods\\04a8aea3-fa7a-4165-8260-374a0269e2d6\\volumes\\kubernetes.io~csi\\pvc-462e8ac4-5ce3-4fde-9398-d0b799d1b128\\mount": file does not exist

for this error, I think it is fixed by https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/1081. If you are brave you may try the unstable/development version of the driver containing this fix at gcr.io/k8s-staging-provider-aws/aws-ebs-csi-driver:v20211007-helm-chart-aws-ebs-csi-driver-2.3.0-12-g4d5d7e7f, otherwise I plan to release the fix in v1.3.2 and will update this issue when it releases.

error 2 from driver 8c6c7e0a590da44c635d629a0653dc73aaa5c9e4 “volume id empty”

 Warning  FailedMount  4m20s (x100 over 3h25m)  kubelet  MountVolume.MountDevice failed for volume "pvc-8b2a752d-4938-411f-9586-d629de7642ac" : rpc error: code = Internal desc = could not format "7" and mount it at "\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\pv\\pvc-8b2a752d-4938-411f-9586-d629de7642ac\\globalmount": rpc error: code = Unknown desc = volume id empty

for this error, I may need CSI node logs to debug further, as the volume 8 looks fine to me from the above output so I am not sure how come the driver was not able to find it.

BTW, let’s try to figure out why the build isn’t working in https://github.com/kubernetes-csi/csi-proxy/issues. since the binary isn’t being distributed yet https://github.com/kubernetes-csi/csi-proxy/issues/83 build needs to work for everyone (since go build with GOARCh and GOOS should work even if you are on an ARM mac)

@wongma7 Updated my release with Helm Chart 2.3.0 and I can see Windows Image was successfully pulled on node:

REPOSITORY                                  TAG     IMAGE ID      CREATED     SIZE
k8s.gcr.io/provider-aws/aws-ebs-csi-driver  v1.3.0  93f2632e63a1  6 days ago  5.75GB

But Pod is in Error/CrashLoopBackOff state. ebs-plugin containers fails

I0923 17:21:23.071346   15268 metadata.go:101] retrieving instance data from ec2 metadata
I0923 17:21:23.317911   15268 metadata.go:108] ec2 metadata is available
panic: open \\.\\pipe\\csi-proxy-filesystem-v1: The system cannot find the file specified.

goroutine 1 [running]:
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.newNodeService(0xc000076a00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/node.go:93 +0x2b9
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.NewDriver(0xc00059ff40, 0x7, 0x7, 0x188603b0108, 0x100000000000011, 0xc000078000)
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/driver.go:94 +0x445
main.main()
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:46 +0x257

@dschunack the container runtime (docker) should only pull the image that corresponds to the OS your container is running on. even though on ECR or Docker Hub it says the image is 2+GB, your Linux Nodes won’t bother to pull the 2 GB windows images, so there’s no need to worry.

Basically now the tag is referring to a manifest list or “fat manifest”, which links to multiple images. Docker knows how to pull the image for your OS and ignore the other ones. https://docs.docker.com/registry/spec/manifest-v2-2/

For reference https://github.com/moby/moby/blob/7b9275c0da707b030e62c96b679a976f31f929d3/distribution/pull_v2_windows.go#L68

Sorry, we checked in the node daemonset without actually releasing the image for it I added a disclaimer here that it’s in pre-release state https://github.com/kubernetes-sigs/aws-ebs-csi-driver/tree/master/examples/kubernetes/windows#windows but ideally the chart shouldn’t offer the option to install something that doesn’t even exist yet.

https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/957 should fix it, after that I will make a release for 1.2.1 manifest list that contains windows image

/assign