aws-ebs-csi-driver: Windows CSI Node DaemonSet is not Running/ErrImagePull
/kind bug
What happened? Windows CSI Node DaemonSet is not running due to an wrong Image without windows support for windows/amd64.
Looks for me that it was not tested on EKS Windows Nodes.
Warning Failed 15s kubelet Failed to pull image "k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.1.3": rpc error: code = Unknown desc = no matching manifest for windows/amd64 10.0.17763 in the manifest list entries
What you expected to happen? Running Windows CSI Node DaemonSet
How to reproduce it (as minimally and precisely as possible)? Activate windows support in helm chart and spin up a Windows Node.
Anything else we need to know?:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 18s default-scheduler Successfully assigned kube-system/ebs-csi-node-windows-pt422 to ip-10-46-67-23.eu-central-1.compute.internal
Normal Pulling 16s kubelet Pulling image "k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.1.3"
Warning Failed 15s kubelet Failed to pull image "k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.1.3": rpc error: code = Unknown desc = no matching manifest for windows/amd64 10.0.17763 in the manifest list entries
Normal Pulled 15s kubelet Container image "k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.1.0" already present on machine
Normal Created 15s kubelet Created container node-driver-registrar
Normal Started 15s kubelet Started container node-driver-registrar
Normal Pulled 15s kubelet Container image "k8s.gcr.io/sig-storage/livenessprobe:v2.2.0" already present on machine
Normal Created 14s kubelet Created container liveness-probe
Normal Started 14s kubelet Started container liveness-probe
Warning Failed 11s (x2 over 12s) kubelet Error: ImagePullBackOff
Environment
- Kubernetes version (use
kubectl version): 1.20 - Driver version: v1.1.3
- Helm Chart v2.0.2
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 36 (16 by maintainers)
@wongma7 Hi, you released a new Container image that support Windows and Linux #957 , but make it really sense to put a Linux and Windows Layer in a single Container Image that supports multi OS? We running currently more then 600 Linux EKS nodes(arm64/amd64) and 30 Windows Nodes. It makes absolute sense to support multi arch in the Container Image, but i think it’s not an good idea to support multi OS in one Image, the overhead due to the different Container architecture is to large. All Linux nodes need to load the whole Container Image with the Windows Layer inside. That is a Hugh overhead per Node. As short example, 600 Linux Nodes x 2.5G for the Windows Layer = 1.5 TB overhead for the Windows Container Layer on Linux Nodes. It need also more Bandwidth, Storage and Time to load/start a new Container Image through a Update. I find it useful to split the images and to create a Linux and a Windows Container Image. It is also easier to support / integrate new features without affecting the other OS. What do you think?
OK, I thought there are 2 different errors so I would expect the logs to be different, but we can focus on the original one because that is from the latest version of the driver
error 1 from driver v1.3.0 “file does not exist”
for this error, I think it is fixed by https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/1081. If you are brave you may try the unstable/development version of the driver containing this fix at
gcr.io/k8s-staging-provider-aws/aws-ebs-csi-driver:v20211007-helm-chart-aws-ebs-csi-driver-2.3.0-12-g4d5d7e7f, otherwise I plan to release the fix in v1.3.2 and will update this issue when it releases.error 2 from driver 8c6c7e0a590da44c635d629a0653dc73aaa5c9e4 “volume id empty”
for this error, I may need CSI node logs to debug further, as the volume 8 looks fine to me from the above output so I am not sure how come the driver was not able to find it.
BTW, let’s try to figure out why the build isn’t working in https://github.com/kubernetes-csi/csi-proxy/issues. since the binary isn’t being distributed yet https://github.com/kubernetes-csi/csi-proxy/issues/83 build needs to work for everyone (since go build with GOARCh and GOOS should work even if you are on an ARM mac)
@wongma7 Updated my release with Helm Chart 2.3.0 and I can see Windows Image was successfully pulled on node:
But Pod is in Error/CrashLoopBackOff state. ebs-plugin containers fails
@dschunack the container runtime (docker) should only pull the image that corresponds to the OS your container is running on. even though on ECR or Docker Hub it says the image is 2+GB, your Linux Nodes won’t bother to pull the 2 GB windows images, so there’s no need to worry.
Basically now the tag is referring to a manifest list or “fat manifest”, which links to multiple images. Docker knows how to pull the image for your OS and ignore the other ones. https://docs.docker.com/registry/spec/manifest-v2-2/
For reference https://github.com/moby/moby/blob/7b9275c0da707b030e62c96b679a976f31f929d3/distribution/pull_v2_windows.go#L68
Sorry, we checked in the node daemonset without actually releasing the image for it I added a disclaimer here that it’s in pre-release state https://github.com/kubernetes-sigs/aws-ebs-csi-driver/tree/master/examples/kubernetes/windows#windows but ideally the chart shouldn’t offer the option to install something that doesn’t even exist yet.
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/957 should fix it, after that I will make a release for 1.2.1 manifest list that contains windows image
/assign