kubernetes: [1.9] kube-controller-manager fails to start with cloud-provider=aws

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug /sig cloud-provider

What happened: –cloud-provider=aws in /etc/kubernetes/manifests/kube-controller-manager.yaml makes it fail to start with CrashLoopBackOff.

kube-controller-manager.yaml

kind: Pod
metadata:
  name: kube-controller-manager
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-controller-manager
    - --leader-elect=true
    - --use-service-account-credentials=true
    - --controllers=*,bootstrapsigner,tokencleaner
    - --root-ca-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
    - --address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/controller-manager.conf
    - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
    - --allocate-node-cidrs=true
    - --cluster-cidr=10.244.0.0/16
    - --node-cidr-mask-size=24
    - --cloud-provider=aws     <----- (without this line, no issue)

Pod state

NAMESPACE     NAME                                                                 READY     STATUS             RESTARTS   AGE
kube-system   kube-controller-manager-ip-172-31-4-117.us-west-1.compute.internal   0/1       CrashLoopBackOff   2          34s

kubelet journalctl output

Dec 31 07:13:58 ip-172-31-4-117.us-west-1.compute.internal kubelet[2055]: I1231 07:13:58.985855    2055 kuberuntime_manager.go:514] Container {Name:kube-controller-manager Image:gcr.io/google_containers/kube-controller-manager-amd64:v1.9.0 Command:[kube-controller-manager --leader-elect=true --use-service-account-credentials=true --controllers=*,bootstrapsigner,tokencleaner --root-ca-file=/etc/kubernetes/pki/ca.crt --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt --cluster-signing-key-file=/etc/kubernetes/pki/ca.key --address=127.0.0.1 --kubeconfig=/etc/kubernetes/controller-manager.conf --service-account-private-key-file=/etc/kubernetes/pki/sa.key --allocate-node-cidrs=true --cluster-cidr=10.244.0.0/16 --node-cidr-mask-size=24 --cloud-provider=aws] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[] Resources:{Limits:map[] Requests:map[cpu:{i:{value:200 scale:-3} d:{Dec:<nil>} s:200m Format:DecimalSI}]} VolumeMounts:[{Name:ca-certs ReadOnly:true MountPath:/etc/ssl/certs SubPath: MountPropagation:<nil>} {Name:kubeconfig ReadOnly:true MountPath:/etc/kubernetes/controller-manager.conf SubPath: MountPropagation:<nil>} {Name:flexvolume-dir ReadOnly:false MountPath:/usr/libexec/kubernetes/kubelet-plugins/volume/exec SubPath: MountPropagation:<nil>} {Name:ca-certs-etc-pki ReadOnly:true MountPath:/etc/pki SubPath: MountPropagation:<nil>} {Name:k8s-certs ReadOnly:true MountPath:/etc/kubernetes/pki SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:10252,Host:127.0.0.1,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:15,TimeoutSeconds:15,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:8,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Dec 31 07:13:58 ip-172-31-4-117.us-west-1.compute.internal kubelet[2055]: I1231 07:13:58.985954    2055 kuberuntime_manager.go:758] checking backoff for container "kube-controller-manager" in pod "kube-controller-manager-ip-172-31-4-117.us-west-1.compute.internal_kube-system(9b9fc44cf056d9e6af795cbe507c7e4f)"
Dec 31 07:13:58 ip-172-31-4-117.us-west-1.compute.internal kubelet[2055]: I1231 07:13:58.986076    2055 kuberuntime_manager.go:768] Back-off 10s restarting failed container=kube-controller-manager pod=kube-controller-manager-ip-172-31-4-117.us-west-1.compute.internal_kube-system(9b9fc44cf056d9e6af795cbe507c7e4f)
Dec 31 07:13:58 ip-172-31-4-117.us-west-1.compute.internal kubelet[2055]: E1231 07:13:58.986105    2055 pod_workers.go:186] Error syncing pod 9b9fc44cf056d9e6af795cbe507c7e4f ("kube-controller-manager-ip-172-31-4-117.us-west-1.compute.internal_kube-system(9b9fc44cf056d9e6af795cbe507c7e4f)"), skipping: failed to "StartContainer" for "kube-controller-manager" with CrashLoopBackOff: "Back-off 10s restarting failed container=kube-controller-manager pod=kube-controller-manager-ip-172-31-4-117.us-west-1.compute.internal_kube-system(9b9fc44cf056d9e6af795cbe507c7e4f)"

What you expected to happen:

As per https://kubernetes.io/docs/reference/generated/kube-controller-manager documentation, --cloud-provider=aws can be specified and the controller starts.

The reason to specify it was because the controller was complaining as below.

$journalctl -b -f CONTAINER_ID=$(docker ps | grep k8s_kube-controller-manager | awk '{ print $1 }')
Dec 31 04:54:47 ip-172-31-4-117.us-west-1.compute.internal dockerd-current[1136]: E1231 04:54:47.447916       1 reconciler.go:292] attacherDetacher.AttachVolume failed to start for volume "pv-ebs" (UniqueName: "kubernetes.io/aws-ebs/vol-0d275986ce24f4304") from node "ip-172-31-4-61.us-west-1.compute.internal" : AttachVolume.NewAttacher failed for volume "pv-ebs" (UniqueName: "kubernetes.io/aws-ebs/vol-0d275986ce24f4304") from node "ip-172-31-4-61.us-west-1.compute.internal" : Failed to get AWS Cloud Provider. GetCloudProvider returned <nil> instead

How to reproduce it (as minimally and precisely as possible): Add the line as a command parameter in /etc/kubernetes/manifests/kube-controller-manager.yaml.

    - --cloud-provider=aws

Anything else we need to know?:

Not being able to specify --cloud-provider could potentially causing the issue of K8S Unable to mount AWS EBS as a persistent volume for pod .

/etc/systemd/system/kubelet.service.d/10-kubeadm.conf has “-cloud-provider=aws”.

[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"
Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=0"
Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=systemd --runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice"
Environment="KUBELET_CERTIFICATE_ARGS=--rotate-certificates=true --cert-dir=/var/lib/kubelet/pki"
Environment="KUBELET_EXTRA_ARGS=--cloud-provider=aws"
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CGROUP_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_EXTRA_ARGS

Environment:

Kubernetes version (use kubectl version):

$ kubectl version -o json { “clientVersion”: { “major”: “1”, “minor”: “9”, “gitVersion”: “v1.9.0”, “gitCommit”: “925c127ec6b946659ad0fd596fa959be43f0cc05”, “gitTreeState”: “clean”, “buildDate”: “2017-12-15T21:07:38Z”, “goVersion”: “go1.9.2”, “compiler”: “gc”, “platform”: “linux/amd64” }, “serverVersion”: { “major”: “1”, “minor”: “9”, “gitVersion”: “v1.9.0”, “gitCommit”: “925c127ec6b946659ad0fd596fa959be43f0cc05”, “gitTreeState”: “clean”, “buildDate”: “2017-12-15T20:55:30Z”, “goVersion”: “go1.9.2”, “compiler”: “gc”, “platform”: “linux/amd64” } }
Cloud provider or hardware configuration: aws
OS (e.g. from /etc/os-release):

$ cat /etc/centos-release CentOS Linux release 7.4.1708 (Core)
Kernel (e.g. uname -a):

Linux ip-172-31-4-117.us-west-1.compute.internal 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Install tools:

kubeadm and ansible

Others:

Unless using the AWS cloud provider feature, pods are working.

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 27 (13 by maintainers)

Commits related to this issue

https://github.com/kubernetes/kubernetes/issues/57718 To be able to use AWS cloud provider. — committed to oonisim/kubernetes-installation by deleted user 6 years ago

Most upvoted comments

Solution

Thanks @dims for guiding me the right way.

After looking into the tickets and read the comments by @kingdonb

Found the documentation below. Is this in kubernetes.io and if not … Why!?

K8S AWS Cloud Provider Notes

Steps

Tagged EC2/SG

---
#--------------------------------------------------------------------------------
# [K8S AWS Cloud Provider Notes]
# https://docs.google.com/document/d/17d4qinC_HnIwrK0GHnRlD1FKkTNdN__VO4TH9-EzbIY/edit
# Set a tag on all resources in the form of KubernetesCluster=<cluster name>
#  All instances
#  One and only one SG for each instance should be tagged.
#  - This will be modified as necessary to allow ELBs to access the instance
#--------------------------------------------------------------------------------
- name: "Get list of instances in the {{ ENV_ID }} environment..."
#  ec2_remote_facts:  # Deprecated
  ec2_instance_facts:
    filters:
      "tag:environment": "{{ ENV_ID }}"
      instance-state-name: running
    region:          "{{ ec2_region }}"
    ec2_access_key:  "{{ ec2_access_key }}"
    ec2_secret_key:  "{{ ec2_secret_key }}"
  register: ec2

- name: "Tag EC2 instancess"
  ec2_tag:
    region:               "{{ ec2_region }}"
    resource:             "{{ item.instance_id }}"
    tags:
      environment:        "{{ ENV_ID }}"
      KubernetesCluster:  "{{ K8S_CLUSTER_NAME }}"
    aws_access_key:       "{{ ec2_access_key }}"
    aws_secret_key:       "{{ ec2_secret_key }}"
  with_items:     "{{ ec2.instances }}"

- name: "Get list of SGs in the {{ ENV_ID }} environment..."
  ec2_group_facts:
    filters:
      group_name:         "{{ ec2_security_group }}"
    region:               "{{ ec2_region }}"
    ec2_access_key:       "{{ ec2_access_key }}"
    ec2_secret_key:       "{{ ec2_secret_key }}"
  register: sg

- name: "Tag SG"
  ec2_tag:
    region:               "{{ ec2_region }}"
    resource:             "{{ item.group_id }}"
    tags:
      environment:        "{{ ENV_ID }}"
      KubernetesCluster:  "{{ K8S_CLUSTER_NAME }}"
    aws_access_key:       "{{ ec2_access_key }}"
    aws_secret_key:       "{{ ec2_secret_key }}"
  with_items:             "{{ sg.security_groups }}"

Run kubeadm init --config <kubeadm config> to pass cloud provider.

kubeadmin init --config kubeadmin_conf.yaml

[kubeadm_conf.yaml.j2]
kind: MasterConfiguration
apiVersion: kubeadm.k8s.io/v1alpha1
api:
  advertiseAddress: {{ K8S_ADVERTISE_ADDRESS }}
networking:
  podSubnet:        {{ K8S_SERVICE_ADDRESSES }}
cloudProvider:      {{ K8S_CLOUD_PROVIDER }}

Result

Jan 02 04:48:28 ip-172-31-4-117.us-west-1.compute.internal dockerd-current[8063]: I0102 04:48:28.752141
1 reconciler.go:287] attacherDetacher.AttachVolume started for volume "kuard-pv" (UniqueName: "kubernetes.io/aws-ebs/vol-0d275986ce24f4304") from node "ip-172-3
Jan 02 04:48:39 ip-172-31-4-117.us-west-1.compute.internal dockerd-current[8063]: I0102 04:48:39.309178
1 operation_generator.go:308] AttachVolume.Attach succeeded for volume "kuard-pv" (UniqueName: "kubernetes.io/aws-ebs/vol-0d275986ce24f4304") from node "ip-172-

$ kubectl describe pod kuard
...
Volumes:
  kuard-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  kuard-pvc
    ReadOnly:   false

$ kubectl describe pv kuard-pv
Name:            kuard-pv
Labels:          failure-domain.beta.kubernetes.io/region=us-west-1
                 failure-domain.beta.kubernetes.io/zone=us-west-1b
                 type=amazonEBS
Annotations:     kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"PersistentVolume","metadata":{"annotations":{},"labels":{"type":"amazonEBS"},"name":"kuard-pv","namespace":""},"spec":{"acce...
                 pv.kubernetes.io/bound-by-controller=yes
StorageClass:    
Status:          Bound
Claim:           default/kuard-pvc
Reclaim Policy:  Retain
Access Modes:    RWO
Capacity:        5Gi
Message:         
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   vol-0d275986ce24f4304
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>

$ kubectl version -o json
{
  "clientVersion": {
    "major": "1",
    "minor": "9",
    "gitVersion": "v1.9.0",
    "gitCommit": "925c127ec6b946659ad0fd596fa959be43f0cc05",
    "gitTreeState": "clean",
    "buildDate": "2017-12-15T21:07:38Z",
    "goVersion": "go1.9.2",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "serverVersion": {
    "major": "1",
    "minor": "9",
    "gitVersion": "v1.9.0",
    "gitCommit": "925c127ec6b946659ad0fd596fa959be43f0cc05",
    "gitTreeState": "clean",
    "buildDate": "2017-12-15T20:55:30Z",
    "goVersion": "go1.9.2",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}

oonisim on Jan 2, 2018

Hi @dims, I have read your comments. Still the point is why --cloud-provider option does not work as it is clearly documented in the kube-controller-manager reference document.

--cloud-provider string                                             The provider for cloud services.  Empty string for no provider.

If this option does not work, then need to clarify if it is as expected or not. If it is as expected, then the documentation fix would be raised, if it is not, then need to clarify why.

If this issue still need to be closed, I suppose all related issues like below would need to be closed too and make it clear that giving --cloud-provider option is not to use but need to give it to kubeadm only if kubeadm is used.

Please give a clarification and the mechanism behind why giving --cloud-provider option to kube-controller-manager does not work although is clearly documented.

Possible suggestion would be re-run kubeadm init after kubeadm reset. However it is also problematic and needs manual clean up of CNI files, etc which will prevent kubeadm init to run.

Initially giving the --cloud-provider=aws in systemd unit file for kubelet stopped the error in kubelet journelctl output, but it did not stop the errors in kube-controller-manager. If we can simply provide --cloud-provider=aws to kube-controller-manager, it would be a good way for all of us who get into this --cloud-provider related issues. Also api-server can take the option with no issue. So why kube-controller-manager fails and what is specific with kube-controller-manager.

oonisim on Jan 4, 2018

For all reading this issue. the following comment helped me alot!

https://github.com/kubernetes/kubernetes/issues/53538#issuecomment-345942305

mad-it on Feb 20, 2018

Hi @dims. I am running into a similar issue. Just a couple of questions/confusions on your and @oonisim 's provided solutions:

Set a tag on all resources in the form of KubernetesCluster=<cluster name> All instances One and only one SG for each instance should be tagged. This will be modified as necessary to allow ELBs to access the instance

Is it really necessary to tag the resources? (Just asking to make things simpler)If so, then where can I find the <cluster name>?

Specifying --config with kubeadm init with the following contents

Does this config file need to be created? Or is it located somwhere? If so, where?

Making sure /etc/kubernetes/cloud-config is created in each node including master with the correct information needed for the specific cloud provider

Can you please specify exactly what should be the content of the cloud-config for AWS?

Lastly, is it possible to modify the cloud-provider of the existing cluster without doing kubeadm init again and specifying the kubeadm.conf. Because what I understand right now is that we need to reset the cluster and then kubeadm init again with the kubeadm.conf to change the cloud-provider. Please correct me if I am wrong.

Thanks

sibtainabbas10 on Jan 24, 2018

Hi @dims, why this is closed? the original issue of not being able to specify --cloud-provider=aws in /etc/kubernetes/manifest/kube-controller-manager.yaml is still there. Please clarify if it is the expected behaviour, or otherwise please reopen to clarify why it does not work.

The installation guide Using kubeadm to Create a Cluster does not mention Cloud Provider and it needs to specify --cloud-provider or configuration file including cloud provider with --config. It would not be uncommon later people notice cloud provider configuration is required and try to reconfigure API server, kubelet, Controller.

In kubeadm join unknown flag: --cloud-provider it says:

You should just edit /etc/systemd/system/kubelet.service.d/10-kubeadm.conf and it will work

If to add --cloud-provider=aws in the config file is the way for kubelet, then it should be the same for API server and controller.

oonisim on Jan 3, 2018