terraform-provider-rke: RC4 - Cannot provision cluster due to rke-network-plugin-deploy-job

So I just downloaded the new provider and attempted to deploy a new cluster using it. It is a single node cluster, with nothing fancy.

It runs through the Still creating... process for about a minute and a half, and then errors out with the following:

time="2020-03-17T17:58:28Z" level=info msg="[network] Setting up network plugin: weave"
time="2020-03-17T17:58:28Z" level=info msg="[addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes"
time="2020-03-17T17:58:28Z" level=info msg="[addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes"
time="2020-03-17T17:58:28Z" level=info msg="[addons] Executing deploy job rke-network-plugin"
time="2020-03-17T17:58:28Z" level=debug msg="[k8s] waiting for job rke-network-plugin-deploy-job to complete.."

Failed running cluster err:Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system

As the log states, I am using the weave CNI plugin which I have always used and has worked great.

I have some defaults applied to my kubernetes components to harden the security of the cluster itself - which did work in the provider that allowed kubernetes 1.15.3.

I am trying to use kubernetes version: v1.17.2-rancher1-2. I did add the debug = true and log_path but it doesn’t seem to add much in terms of troubleshooting with obvious errors.

This is the contents of my rke.tf file:

# ---------------------------------------------------------------------
# RKE configuration

resource rke_cluster "cluster" {
  depends_on = [azurestack_public_ip.vmpip, azurestack_virtual_machine.vm]

  dynamic "nodes" {
    for_each = azurestack_public_ip.vmpip.*
    iterator = nodes

    content {
      address = nodes.value.ip_address
      user    = "testuser"
      role    = ["controlplane","etcd", "worker"]
      ssh_key = file("/opt/${var.deployment_name}/${var.deployment_name}")
    }
  }

  ignore_docker_version = true
  cluster_name = "${var.deployment_name}-cluster"

  # Kubernetes version
  kubernetes_version = "v1.17.2-rancher1-2"

  private_registries {
    url      = "myprivateregistry"
  }

  #########################################################
  # Network(CNI) - supported: flannel/calico/canal/weave
  #########################################################
  # There are several network plug-ins that work, but we default to canal
  network {
    plugin = "weave"
  }

  ingress {
    provider = "nginx"
    options = {
      proxy-buffer-size = "16k"
      http2 = "true"
    }
    extra_args = {
      default-ssl-certificate = "ingress-nginx/wildcard-ingress"
    }
  }

  services {
    kube_api {
      pod_security_policy = "false"
      extra_args = {
        anonymous-auth = "false"
        admission-control-config-file = "/opt/kubernetes/admission.yaml"
        profiling = "false"
        service-account-lookup = "true"
        audit-log-maxage = "30"
        audit-log-maxbackup = "10"
        audit-log-maxsize = "100"
        audit-log-format = "json"
        audit-policy-file = "/opt/kubernetes/audit.yaml"
        audit-log-path = "/var/log/kube-audit/audit-log.json"
        enable-admission-plugins = "ServiceAccount,PodPreset,NamespaceLifecycle,LimitRanger,PersistentVolumeLabel,DefaultStorageClass,ResourceQuota,DefaultTolerationSeconds,AlwaysPullImages,SecurityContextDeny,PodSecurityPolicy,NodeRestriction,EventRateLimit"
        runtime-config = "batch/v2alpha1,authentication.k8s.io/v1beta1=true,settings.k8s.io/v1alpha1=true"
        tls-cipher-suites = "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256"
      }
      # Optionally, define extra binds for the api server
      extra_binds = [
        "/var/log/kube-audit:/var/log/kube-audit",
        "/opt/kubernetes:/opt/kubernetes"
      ]
    }

    scheduler {
      extra_args = {
        address = "127.0.0.1"
      }
    }

    kube_controller {
      extra_args = {
        profiling = "false"
        address = "127.0.0.1"
        terminated-pod-gc-threshold = "1000"
        feature-gates = "RotateKubeletServerCertificate=true"
      }
    }

    kubelet  {
      extra_args = {
        volume-plugin-dir = "/usr/libexec/kubernetes/kubelet-plugins/volume/exec"
        #protect-kernel-defaults = true # requires additional config
        streaming-connection-idle-timeout = "1800s"
        authorization-mode = "Webhook"
        make-iptables-util-chains = "true"
        event-qps = "0"
        anonymous-auth = "false"
        feature-gates = "RotateKubeletServerCertificate=true"
        tls-cipher-suites = "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256"
      }
      # Optionally define additional volume binds to a service
      extra_binds = [
        "/usr/libexec/kubernetes/kubelet-plugins/volume/exec:/usr/libexec/kubernetes/kubelet-plugins/volume/exec",
      ]
    }
  }
}

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 23 (10 by maintainers)

Most upvoted comments

Yeah, definitely the issue is on debug argument. Deleting debug argument or setting it to false address the issue.

The problem is caused by rke docker.pullimage function when debug is set, https://github.com/rancher/rke/blob/v1.0.4/docker/docker.go#L266 . This is breaking the rke execution on non TTY and stucking the process.

rawmind0 on Mar 19, 2020