kops: kops ami might have a problem with k8s 1.8 (kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08)

https://kubernetes.slack.com/archives/C3QUFP0QM/p1521739991000074

Summary: Updating the ami to kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08 on the kubernetes 1.8 masters and node seems to cause them to fail the AWS EC2 Instance reachability check and not become healthy. Aws restarts them repeatedly.

  1. What kops version are you running? The command kops version, will display this information. Version 1.8.1 (git-94ef202)
  2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.
--- kubernetes/kops ‹master› » kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.10", GitCommit:"044cd262c40234014f01b40ed7b9d09adbafe9b1", GitTreeState:"clean", BuildDate:"2018-03-19T17:51:28Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.10", GitCommit:"044cd262c40234014f01b40ed7b9d09adbafe9b1", GitTreeState:"clean", BuildDate:"2018-03-19T17:44:09Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  1. What cloud provider are you using? aws
  2. What commands did you run? What is the simplest way to reproduce this issue? kops rolling update cluster --yes
  3. What happened after the commands executed?
Instance reachability check fails and instances is restarted many times. This happened when just changing the ami (kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-01-14 -> kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08) , and also when changing ami and k8s version from 1.8.8 to 1.8.10
  1. What did you expect to happen? master to come back up
  2. Please provide your cluster manifest. Execute kops get --name my.example.com -oyaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2017-02-09T23:47:31Z
  name: <redacted>
spec:
  additionalPolicies:
    node: |
      [
        {
          "Effect": "Allow",
          "Action": ["ec2:AttachVolume"],
          "Resource": ["*"]
        },
        {
          "Effect": "Allow",
          "Action": ["ec2:DetachVolume"],
          "Resource": ["*"]
        }
      ]
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: <redacted>
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1c
      name: c
    - instanceGroup: master-us-east-1d
      name: d
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1c
      name: c
    - instanceGroup: master-us-east-1d
      name: d
    name: events
  iam:
    legacy: false
  kubernetesApiAccess:
  - <redacted>
  kubernetesVersion: 1.8.10
  masterInternalName: <redacted>
  masterPublicName: <redacted>
  networkCIDR: 10.101.0.0/16
  networking:
    kubenet: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - <redacted>
  subnets:
  - cidr: 10.101.32.0/19
    name: us-east-1a
    type: Public
    zone: us-east-1a
  - cidr: 10.101.64.0/19
    name: us-east-1c
    type: Public
    zone: us-east-1c
  - cidr: 10.101.96.0/19
    name: us-east-1d
    type: Public
    zone: us-east-1d
  - cidr: 10.101.128.0/19
    name: us-east-1e
    type: Public
    zone: us-east-1e
  topology:
    dns:
      type: Public
    masters: public
    nodes: public
  1. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

  2. Anything else do we need to know?

------------- FEATURE REQUEST TEMPLATE --------------------

  1. Describe IN DETAIL the feature/behavior/change you would like to see.

  2. Feel free to provide a design supporting your feature request.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 1
  • Comments: 18 (11 by maintainers)

Commits related to this issue

Most upvoted comments

I ran into this problem as well. I have the problem when using m3.large, but not when using m3.medium.

I see the following crash when I look at the instance system log in AWS: https://gist.github.com/wendorf/91f5a2c77c3cdc277e48c2c22fc0b46b

We were facing the same issue in r3.large types. The issue gets fixed on upgrading the kernel in the image.

Image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08

Error in ec2 system log

  [    1.132048] Kernel panic - not syncing: Fatal exception
    [    1.133919] Kernel Offset: disabled

Below steps fixes the issue and instance status check passes after that.

1. Launched an r3.large instance in ap-south-1 region with ami-640d5f0b.
The instance did not boot up and failed instance status checks as expected.
2. Stopped the instance from console and after it was stopped, I changed the instance type to r3.xlarge
3. STARTed the instance and it came up and passed health checks.
4. Connected to the instance via SSH and switched to root user (sudo su)
5. Checked the kernel version (uname -r) and confirmed that it is 4.4.115
6. Searched and installed newer kernel (4.4.121) with the steps below:
        a. apt-cache search linux-image
        b. apt-get install linux-image-4.4.121
        c. apt-get update
7. Rebooted the Operation System to allow it to process and reflect the kernel upgrade.
8. Confirmed that the kernel is upgraded (uname -r) which showed 4.4.121.
9. Stopped the instance and switched it back to r3.large instance type
10. STARTed the instance and it came up and passed instance status checks this time.

@dmcnaught @chrislovecnm @justinsb The issue seems critical and should be fixed soon, as it is the latest version of public recommended AMI which gets by default in every k8s installation using kops. Would love to fix this.