kubernetes: Can't get --cloud-provider=aws to work (instance not found)

I’m trying to set up AWS ELB attach volume access and keep getting an error on the kubelet. I have some bad feeling it might be related to #9801 as these nodes are CoreOS that were brought up by Terraform with custom service files but everything else is working fine (we’re running cluster monitoring, dns, and a number of our own pods without issues). I’ve run awscli on that instance and the privileges definitely work. What am I doing wrong?

logs from kubelet service

I0719 02:22:59.549444    9662 factory.go:234] Registering Docker factory
I0719 02:22:59.549826    9662 factory.go:89] Registering Raw factory
I0719 02:22:59.638468    9662 manager.go:946] Started watching for new ooms in manager
I0719 02:22:59.639054    9662 oomparser.go:183] oomparser using systemd
I0719 02:22:59.640589    9662 manager.go:243] Starting recovery of all containers
E0719 02:22:59.679107    9662 kubelet.go:787] Unable to construct api.Node object for kubelet: failed to get external ID from cloud provider: instance not found
I0719 02:22:59.753816    9662 manager.go:248] Recovery completed
I0719 02:22:59.816369    9662 status_manager.go:76] Starting to sync pod status with apiserver
I0719 02:22:59.816426    9662 kubelet.go:1725] Starting kubelet main sync loop.
E0719 02:22:59.953337    9662 kubelet.go:787] Unable to construct api.Node object for kubelet: failed to get external ID from cloud provider: instance not found
E0719 02:23:01.047929    9662 kubelet.go:787] Unable to construct api.Node object for kubelet: failed to get external ID from cloud provider: instance not found
E0719 02:23:01.975409    9662 kubelet.go:787] Unable to construct api.Node object for kubelet: failed to get external ID from cloud provider: instance not found
E0719 02:23:03.645707    9662 kubelet.go:787] Unable to construct api.Node object for kubelet: failed to get external ID from cloud provider: instance not found
E0719 02:23:06.917503    9662 kubelet.go:787] Unable to construct api.Node object for kubelet: failed to get external ID from cloud provider: instance not found
I0719 02:23:07.790941    9662 server.go:635] POST /stats/container/: (46.254087ms) 0 [[Go 1.1 package http] 10.0.39.222:45497]

kube-kubelet.service

[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/GoogleCloudPlatform/kubernetes

[Service]
Environment="KUBERNETES_BINARY_VERSION=1.0.0"
EnvironmentFile=/etc/environment
ExecStartPre=/usr/bin/curl -L -o /opt/bin/kubelet https://storage.googleapis.com/kubernetes-release/release/v${KUBERNETES_BINARY_VERSION}/bin/linux/amd64
ExecStartPre=/usr/bin/chmod +x /opt/bin/kubelet
ExecStart=/opt/bin/kubelet \
  --address=0.0.0.0 \
  --port=10250 \
  --cloud-provider=aws \
  --hostname-override=${COREOS_PRIVATE_IPV4} \
  --api-servers=${KUBE_MASTER_IP}:8080 \
  --allow-privileged=false \
  --cluster_dns=10.2.0.2 \
  --cluster_domain=cluster.local \
  --cadvisor_port=4194 \
  --healthz_bind_address=0.0.0.0 \
  --healthz_port=10248 \
  --v=2 \
  --logtostderr=true
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

iam policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2:AttachVolume",
        "ec2:DetachVolume"
      ],
      "Resource": [
        "arn:aws:ec2:*:*:instance/*"
      ]
    },
  ]
}

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Comments: 60 (33 by maintainers)

Commits related to this issue

Most upvoted comments

Well I found the cause at least:

Kubelet attempts to get its ExternalID. To do this is uses the mis-named aws.findInstanceByNodeName which actually looks up the node by private-dns-name not node name (which is potentially a good thing since in the case of autoscaling groups node names won’t be unique, but still, confusing name). This would probably work under a normal situation but we’re using a VPC DHCP option set to use an internal management domain, not the standard <region>.compute.internal. When the kubelet is configured for AWS it uses the metadata service to get it’s local-hostname which in our case doesn’t match the private DNS name, it matches our management domain.

I’m not really sure what the right fix here is. My instinct is that --hostname-override should actually do something even when you’re using --cloud-provider=aws (it currently seems to do nothing). I notice that // TODO: ExternalID is deprecated, we'll have to drop this code but I couldn’t find out what that specifically means. Also interesting to me is that the kubernetes.io/hostname label is inconsistent in the case of using cloud provider unless the hostname exactly matches the “ExternalID”. I’m not actually sure this entire section of code (kubelet.go 739-761) even does anything useful vs just using kl.hostname directly for the ExternalID. If ExternalID is really deprecated I’m inclined to submit a patch to rip that out (well, replace it with just kl.hostname).

Just to add to this… we’ve setup K8S 1.0.6 inside our VPC and just added --cloud-provider=aws (and the accompanying IAM profiles to our nodes). While things like kubectl work fine, we noticed that the kubelet processes are now registering themselves in the kube-apiserver with their EC2 name rather than the hostnames we use internally.

Now, we “allow” use of the .internal domain name from Amazon in the sense that we can resolve those names just fine. The only real issue here is that the nodeName in Kubernetes is now wrong and misleading for our engineers:

ip-10-16-80-224.us-west-1.compute.internal   kubernetes.io/hostname=tools-k8s-node-uswest1-17-i-905b2250   Ready
ip-10-16-82-31.us-west-1.compute.internal    kubernetes.io/hostname=tools-k8s-node-uswest1-15-i-67621ba7   Ready
ip-10-16-86-186.us-west-1.compute.internal   kubernetes.io/hostname=tools-k8s-node-uswest1-16-i-4eff2bfc   Ready

I also tried adding --hostname-override=tools-k8s-node-uswest1-17-i-905b2250 to kubelet’s startup parameters and it had no impact on the name actually registered in the api server.

@thockin @justinsb

This also breaks the AWS cloud provider when using private hosted DNS zones.

Essentially, kubelet thinks the nodename is whatever the local hostname is (“ip-xx-xx-xx-xx-.my.custom.domain.com”) and then tries to get the instance details by querying the EC2 API by private-dns-name, which actually is something like “ip-xx-xx-xx-xx.us-west-2.compute.internal”). This fails of course.

This makes it impossible to use the AWS cloud provider and therefore automatic ELB provisioning, etc.

I’m not familiar with github but how can this be re-opened? Having read through the entire thread, I don’t understand why this was closed. I’m using 1.13.1 and this is causing a problem, and it is actually a bigger problem than most people are commenting here.

Going back to the original post, the problem is our AWS environment is confined to a VPC and our company’s domain, so hostname has to be in “my_hostname.my_company.net” format. We can’t use ip-10-xx-xx-xx name, but setting --cloud-provider=aws with kubelet looks for hostname in ip-10-xx-xx-xx format. The problem is in order to be able to create persistent volumes in AWS (ebs or efs types), I need to set --cloud-provider=aws but I cannot do that because node names have to be in ip-10-xx-xx-xx format, and on it goes around the circle. I say this is a bigger problem because I cannot create any persistent volume in my cluster as long as the node names are in my_host.my_company.net format but being able to create a persistent volume is almost an absolute necessity for a any reasonably useful cluster.

I think this is still an issue with kubeadm v1.8.2. There needs to be way more doc for AWS as a cloud provider, or maybe something out-of-the-box with kubeadm. At the moment, it seems super under doc’ed, especially when many people have this use-case right?

I like the idea to switch to instanceId as the node-name, but I can see issues here. Ideally I would like to be able to set my node-names to be DNS resolvable like they currently are. What about having the findInstanceByNodeName method look for instanceId, private-dns-name OR a tag called node-name?

Currently I am looking at running OpenShift on Kubernetes and the only feature of --cloud-provider=aws that I really need is the EBS volume plugin. It seems a bit overkill to require all my node names to be instanceId just to get that plugin working.

(This used to work without using --cloud-provider=aws)

Forgive me if I’m missing something obvious here, and for repeating much of what @philk and @bkeroackdsc have said already, but I’m confused by the above conclusions. Using an instance-id as a node name (as suggested in #11883) is not ideal, non-default hostnames are that way for a reason and it’s going to be much clearer to users / administrators if they see the hostname rather than an instance id (imho). When securing a dynamic infrastructure it’s often a necessity to work with a private domain in order to supply certificates - the ip addresses of nodes are unpredictable. Cloud providers need not be aware of this.

Kubernetes internally should not conflate the node name - the hostname of a node in the cluster, with the instance-id, the identifier it uses to communicate with the aws cloud provider.

The call to ExternalID eventually ends up calling getInstanceByNodeName (see below), with the implication that the node name is the private-dns-name - can’t this simply be changed to a more correct behavior of retrieving the instance id from the metadata service, and adjusting the query to use that? The change would entail fetching the instance-id from the metadata service, caching that, and changing the tag query to use instance-id instead of private-dns-name. This should always succeed as instance-id is unique to an instance.

https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L2083

The only potential complication I foresee is if another member of the cluster needs to use the kubernetes node name in order to operate on the cloud provider, i.e. trying to attach an volume to an instance. As far as I can tell those operations are done locally, which is further implied by the iam permissions applied to kubelet nodes (ec2:AttachVolume and ec2:DetachVolume), but I haven’t read the entire codebase. If this is a necessary use case, I’d suggest that a reverse dns lookup for the hostname, and then finding the instance with the cloud provider based on it’s private ip, is more correct than assuming that the cloud provider has any knowledge of the instances hostname.

I’d be happy to put this change together but would like some feedback on the idea - @thockin @justinsb thoughts?