cluster-api-provider-azure: windows cluster with cloud-provider: external does not work
/kind bug
What steps did you take and what happened:
I created a CAPZ cluster with windows machine pools inside and cloud-provider: external
.
I deployed the cloud provider using the official helm chart.
The Daemonset for cloud-node-manager-windows is not able to reach the metadata endpoint and therefore windows nodes always have the Taint node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
.
Cluster resources YAML
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
labels:
cni: calico
ccm: "external"
name: mycluster
namespace: mynamespace
spec:
clusterNetwork:
pods:
cidrBlocks:
- 192.168.0.0/16
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
name: ctrl-plane
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureCluster
name: mycluster
---
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
name: ctrl-plane
namespace: mynamespace
spec:
kubeadmConfigSpec:
clusterConfiguration:
apiServer:
extraArgs:
cloud-config: /etc/kubernetes/azure.json
cloud-provider: external
extraVolumes:
- hostPath: /etc/kubernetes/azure.json
mountPath: /etc/kubernetes/azure.json
name: cloud-config
readOnly: true
timeoutForControlPlane: 20m0s
controllerManager:
extraArgs:
allocate-node-cidrs: "false"
cloud-config: /etc/kubernetes/azure.json
cloud-provider: external
cluster-name: mycluster
extraVolumes:
- hostPath: /etc/kubernetes/azure.json
mountPath: /etc/kubernetes/azure.json
name: cloud-config
readOnly: true
dns: {}
etcd:
local:
dataDir: /var/lib/etcddisk/etcd
extraArgs:
quota-backend-bytes: "8589934592"
networking: {}
scheduler: {}
diskSetup:
filesystems:
- device: /dev/disk/azure/scsi1/lun0
extraOpts:
- -E
- lazy_itable_init=1,lazy_journal_init=1
filesystem: ext4
label: etcd_disk
- device: ephemeral0.1
filesystem: ext4
label: ephemeral0
replaceFS: ntfs
partitions:
- device: /dev/disk/azure/scsi1/lun0
layout: true
overwrite: false
tableType: gpt
files:
- contentFrom:
secret:
key: control-plane-azure.json
name: ctrl-plane-azure-json
owner: root:root
path: /etc/kubernetes/azure.json
permissions: "0644"
format: cloud-config
initConfiguration:
localAPIEndpoint: {}
nodeRegistration:
kubeletExtraArgs:
azure-container-registry-config: /etc/kubernetes/azure.json
cloud-config: /etc/kubernetes/azure.json
cloud-provider: external
name: '{{ ds.meta_data["local_hostname"] }}'
joinConfiguration:
discovery: {}
nodeRegistration:
kubeletExtraArgs:
azure-container-registry-config: /etc/kubernetes/azure.json
cloud-config: /etc/kubernetes/azure.json
cloud-provider: external
name: '{{ ds.meta_data["local_hostname"] }}'
mounts:
- - LABEL=etcd_disk
- /var/lib/etcddisk
machineTemplate:
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureMachineTemplate
name: ctrl-plane
metadata: {}
replicas: 1
rolloutStrategy:
rollingUpdate:
maxSurge: 1
type: RollingUpdate
version: v1.24.3
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureCluster
metadata:
name: mycluster
namespace: mynamespace
spec:
azureEnvironment: AzurePublicCloud
bastionSpec: {}
identityRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureClusterIdentity
name: mycluster-identity
location: westeurope
networkSpec:
subnets:
- cidrBlocks:
- 10.0.0.0/16
name: control-plane-subnet
role: control-plane
- cidrBlocks:
- 10.1.0.0/16
name: node-subnet
role: node
vnet:
cidrBlocks:
- 10.0.0.0/8
name: mycluster-vnet
resourceGroup: mycluster
subscriptionID: ${SUBSCRIPTION_ID}
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureMachineTemplate
metadata:
name: ctrl-plane
namespace: mynamespace
spec:
template:
metadata: {}
spec:
dataDisks:
- cachingType: ReadWrite
diskSizeGB: 256
lun: 0
nameSuffix: etcddisk
identity: None
image:
sharedGallery:
gallery: ${GALLERY}
name: ${NAME}
offer: ${OFFER}
resourceGroup: ${RESOURCE_GROUP}
subscriptionID: ${SUBSCRIPTION_ID}
sku: ${SKU}
version: ${VERSION}
osDisk:
cachingType: ReadOnly
diskSizeGB: 150
osType: Linux
diffDiskSettings:
option: Local # ephemeral OS disk
sshPublicKey: ""
vmSize: Standard_D4ads_v5
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureClusterIdentity
metadata:
name: mycluster-identity
namespace: mynamespace
spec:
allowedNamespaces:
list:
- mynamespace
clientID: ${CLIENT_ID}
clientSecret:
name: mycluster-identity-secret
namespace: mynamespace
tenantID: ${TENANT_ID}
type: ManualServicePrincipal
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
name: worker
namespace: mynamespace
spec:
clusterName: mycluster
minReadySeconds: 0
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 1
selector:
matchLabels: null
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
template:
spec:
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
name: worker
clusterName: mycluster
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureMachineTemplate
name: worker
version: v1.24.3
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureMachineTemplate
metadata:
name: worker
namespace: mynamespace
spec:
template:
metadata: {}
spec:
identity: None
image:
sharedGallery:
gallery: ${GALLERY}
name: ${NAME}
offer: ${OFFER}
resourceGroup: ${RESOURCE_GROUP}
subscriptionID: ${SUBSCRIPTION_ID}
sku: ${SKU}
version: ${VERSION}
osDisk:
cachingType: ReadOnly
diskSizeGB: 150
osType: Linux
diffDiskSettings:
option: Local # ephemeral OS disk
sshPublicKey: ""
vmSize: Standard_D8ads_v5
---
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
metadata:
name: worker
namespace: mynamespace
spec:
template:
spec:
files:
- contentFrom:
secret:
key: worker-node-azure.json
name: worker-azure-json
owner: root:root
path: /etc/kubernetes/azure.json
permissions: "0644"
format: cloud-config
joinConfiguration:
discovery: {}
nodeRegistration:
kubeletExtraArgs:
azure-container-registry-config: /etc/kubernetes/azure.json
cloud-config: /etc/kubernetes/azure.json
cloud-provider: external
node-labels: "node.helio.exchange/role=worker"
name: '{{ ds.meta_data["local_hostname"] }}'
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
name: win
namespace: mynamespace
annotations:
cluster.x-k8s.io/replicas-managed-by-autoscaler: "true"
spec:
clusterName: mycluster
minReadySeconds: 0
replicas: 1
template:
spec:
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfig
name: win
clusterName: mycluster
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureMachinePool
name: win
version: v1.24.3
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureMachinePool
metadata:
name: win
namespace: mynamespace
spec:
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
location: westeurope
template:
image:
sharedGallery:
gallery: ${GALLERY}
name: ${WIN_NAME}
offer: ${WIN_OFFER}
resourceGroup: ${RESOURCE_GROUP}
subscriptionID: ${SUBSCRIPTION_ID}
sku: ${WIN_SKU}
version: ${WIN_VERSION}
osDisk:
cachingType: ReadOnly
diskSizeGB: 750
osType: Windows
diffDiskSettings:
option: Local # ephemeral OS disk
sshPublicKey: ""
vmSize: Standard_D96as_v4
spotVMOptions: {}
identity: None
additionalTags:
cluster-autoscaler-enabled: "true"
cluster-autoscaler-name: "mycluster"
min: "0"
max: "5"
---
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfig
metadata:
name: win
namespace: mynamespace
spec:
files:
- contentFrom:
secret:
key: worker-node-azure.json
name: win-azure-json
owner: root:root
path: c:/k/azure.json
permissions: "0644"
- content: |-
Add-MpPreference -ExclusionProcess C:/opt/cni/bin/calico.exe
Add-MpPreference -ExclusionProcess C:/opt/cni/bin/calico-ipam.exe
path: C:/defender-exclude-calico.ps1
permissions: "0744"
format: cloud-config
joinConfiguration:
discovery: {}
nodeRegistration:
criSocket: npipe:////./pipe/containerd-containerd
kubeletExtraArgs:
azure-container-registry-config: c:/k/azure.json
cloud-config: c:/k/azure.json
cloud-provider: external
feature-gates: WindowsHostProcessContainers=true
v: "2"
windows-priorityclass: ABOVE_NORMAL_PRIORITY_CLASS
volume-plugin-dir: "C:\\k\\volumeplugins"
name: '{{ ds.meta_data["local_hostname"] }}'
postKubeadmCommands:
- nssm set kubelet start SERVICE_AUTO_START
- powershell C:/defender-exclude-calico.ps1
Helm chart values changed only the following:
parameters:
- name: infra.clusterName
value: "{{name}}"
- name: cloudNodeManager.waitRoutes
value: "true"
When deploying this, a timeout happens after a while:
cloud-node-manager-windows-5ml4w cloud-node-manager panic: failed to initialize node NODENAME at cloudprovider: failed to set node provider id: Get "http://169.254.169.254/metadata/instance?api-version=2021-10-01&format=json": dial tcp 169.254.169.254:80: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
cloud-node-manager-windows-5ml4w cloud-node-manager
I tried to switch to hostProcess/hostNetwork for cloud-node-manager-windows but I wasn’t sure what the proper way of propagating a correct kubeconfig was.
Hostprocess cloud-node-manager-windows
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: cloud-node-manager-windows
namespace: kube-system
labels:
component: cloud-node-manager
kubernetes.io/cluster-service: "true"
spec:
selector:
matchLabels:
k8s-app: cloud-node-manager-windows
template:
metadata:
labels:
k8s-app: cloud-node-manager-windows
annotations:
cluster-autoscaler.kubernetes.io/daemonset-pod: "true"
spec:
priorityClassName: system-node-critical
serviceAccountName: cloud-node-manager
securityContext:
windowsOptions:
hostProcess: true
runAsUserName: "NT AUTHORITY\\SYSTEM"
hostNetwork: true
nodeSelector:
kubernetes.io/os: windows
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- operator: "Exists"
effect: NoExecute
- operator: "Exists"
effect: NoSchedule
containers:
- name: cloud-node-manager
image: {{ template "image.cloudNodeManager" . | required "you must use a supported version of Kubernetes or provide cloudNodeManager.imageRepository, cloudNodeManager.imageName, and cloudNodeManager.imageTag values" }}
imagePullPolicy: {{ .Values.cloudNodeManager.imagePullPolicy }}
command: ["cloud-node-manager.exe"]
workingDir: "$env:CONTAINER_SANDBOX_MOUNT_POINT/"
args:
- --node-name=$(NODE_NAME)
{{- if hasKey .Values.cloudNodeManager "cloudConfig" }}
- "--cloud-config={{ .Values.cloudNodeManager.cloudConfig }}"
{{- end }}
{{- if hasKey .Values.cloudNodeManager "kubeAPIBurst" }}
- "--kube-api-burst={{ .Values.cloudNodeManager.kubeAPIBurst }}"
{{- end }}
{{- if hasKey .Values.cloudNodeManager "kubeAPIContentType" }}
- "--kube-api-content-type={{ .Values.cloudNodeManager.kubeAPIContentType }}"
{{- end }}
{{- if hasKey .Values.cloudNodeManager "kubeAPIQPS" }}
- "--kube-api-qps={{ .Values.cloudNodeManager.kubeAPIQPS }}"
{{- end }}
{{- if hasKey .Values.cloudNodeManager "kubeconfig" }}
- "--kubeconfig={{ .Values.cloudNodeManager.kubeconfig }}"
{{- end }}
{{- if hasKey .Values.cloudNodeManager "master" }}
- "--master={{ .Values.cloudNodeManager.master }}"
{{- end }}
{{- if hasKey .Values.cloudNodeManager "minResyncPeriod" }}
- "--min-resync-period={{ .Values.cloudNodeManager.minResyncPeriod }}"
{{- end }}
{{- if hasKey .Values.cloudNodeManager "nodeStatusUpdateFrequency" }}
- "--node-status-update-frequency={{ .Values.cloudNodeManager.nodeStatusUpdateFrequency }}"
{{- end }}
{{- if hasKey .Values.cloudNodeManager "useInstanceMetadata" }}
- "--use-instance-metadata={{ .Values.cloudNodeManager.useInstanceMetadata }}"
{{- end }}
{{- if hasKey .Values.cloudNodeManager "waitRoutes" }}
- "--wait-routes={{ .Values.cloudNodeManager.waitRoutes }}"
{{- end }}
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
requests:
cpu: {{ .Values.cloudNodeManager.containerResourceManagement.requestsCPUWin }}
memory: {{ .Values.cloudNodeManager.containerResourceManagement.requestsMemWin }}
limits:
cpu: {{ .Values.cloudNodeManager.containerResourceManagement.limitsCPUWin }}
memory: {{ .Values.cloudNodeManager.containerResourceManagement.limitsMemWin }}
When using this, the following is logged:
cloud-node-manager-windows-8scmm cloud-node-manager W0823 11:48:33.679405 5288 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
cloud-node-manager-windows-8scmm cloud-node-manager W0823 11:48:33.703905 5288 client_config.go:622] error creating inClusterConfig, falling back to default config: open /var/run/secrets/kubernetes.io/serviceaccount/token: The system cannot find the path specified.
cloud-node-manager-windows-8scmm cloud-node-manager invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
I believe this could be related to #2132. I needed to patch RBAC for the helm chart too.
What did you expect to happen: external cloud-provider-azure works for windows clusters too.
Environment:
- cluster-api-provider-azure version: v1.4.1
- Kubernetes version: (use
kubectl version
): v1.24.3 - OS (e.g. from
/etc/os-release
): linux/windows mixed
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 32 (29 by maintainers)
@mweibel the e2e tests use the latest helm chart (CCM and CNM version is determined from k8s version)
the change to hostProcess with the script was https://github.com/kubernetes-sigs/cloud-provider-azure/pull/3283