consul-k8s: Add /quitquitquit endpoint to lifecycle sidecar/envoy so Jobs can self-terminate their sidecars

Overview of the Issue

When using job, consul-connect-envoy-sidecar makes them running forever even when the Job container is in Terminated status.

Reproduction Steps

  1. Create a cluster and install consul connect via helm with this override:

    ---
    global:
      enabled: true
    
    server:
      replicas: 1
      bootstrapExpect: 1
      connect: true
      storageClass: nfs-client
    
    client:
      grpc: true
    
    ui:
      enabled: true
    
    connectInject:
      enabled: true
      default: true
      centralConfig:
        enabled: true
        defaultProtocol: http
    
  2. Create a Job file job.yml:

    ---
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: pi
    spec:
      template:
        spec:
          containers:
          - name: pi
            image: perl
            command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
          restartPolicy: Never
      backoffLimit: 4
    
  3. Apply: kubectl apply -f job.yml

  4. Wait a little bit

  5. look at the job kubectl describe po -ljob-name=pi:

Name:           pi-7znmn
Namespace:      default
Priority:       0
Node:           compute02-test-2/10.253.0.9
Start Time:     Thu, 24 Oct 2019 06:39:05 +0000
Labels:         controller-uid=a5b8429c-44ed-48ae-bb81-82ff12772ffe
                job-name=pi
Annotations:    consul.hashicorp.com/connect-inject-status: injected
                consul.hashicorp.com/connect-service: pi
                consul.hashicorp.com/connect-service-protocol: http
Status:         Running
IP:             10.233.67.186
Controlled By:  Job/pi
Init Containers:
  consul-connect-inject-init:
    Container ID:  docker://a7ef1cf9a296912b66110691345e1872f86368caaab8b3a88e200f4a71b5b8eb
    Image:         consul:1.6.1
    Image ID:      docker-pullable://consul@sha256:94cdbd83f24ec406da2b5d300a112c14cf1091bed8d6abd49609e6fe3c23f181
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -ec
      export CONSUL_HTTP_ADDR="${HOST_IP}:8500"
      export CONSUL_GRPC_ADDR="${HOST_IP}:8502"

      # Register the service. The HCL is stored in the volume so that
      # the preStop hook can access it to deregister the service.
      cat <<EOF >/consul/connect-inject/service.hcl
      services {
        id   = "${POD_NAME}-pi-sidecar-proxy"
        name = "pi-sidecar-proxy"
        kind = "connect-proxy"
        address = "${POD_IP}"
        port = 20000

        proxy {
          destination_service_name = "pi"
          destination_service_id = "pi"
        }

        checks {
          name = "Proxy Public Listener"
          tcp = "${POD_IP}:20000"
          interval = "10s"
          deregister_critical_service_after = "10m"
        }

        checks {
          name = "Destination Alias"
          alias_service = "pi"
        }
      }

      services {
        id   = "${POD_NAME}-pi"
        name = "pi"
        address = "${POD_IP}"
        port = 0
      }
      EOF
      # Create the central config's service registration
      cat <<EOF >/consul/connect-inject/central-config.hcl
      kind = "service-defaults"
      name = "pi"
      protocol = "http"
      EOF
      /bin/consul config write -cas -modify-index 0 \
        /consul/connect-inject/central-config.hcl || true

      /bin/consul services register \
        /consul/connect-inject/service.hcl

      # Generate the envoy bootstrap code
      /bin/consul connect envoy \
        -proxy-id="${POD_NAME}-pi-sidecar-proxy" \
        -bootstrap > /consul/connect-inject/envoy-bootstrap.yaml

      # Copy the Consul binary
      cp /bin/consul /consul/connect-inject/consul
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 24 Oct 2019 06:39:08 +0000
      Finished:     Thu, 24 Oct 2019 06:39:08 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      HOST_IP:         (v1:status.hostIP)
      POD_IP:          (v1:status.podIP)
      POD_NAME:       pi-7znmn (v1:metadata.name)
      POD_NAMESPACE:  default (v1:metadata.namespace)
    Mounts:
      /consul/connect-inject from consul-connect-inject-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-p4ntz (ro)
Containers:
  pi:
    Container ID:  docker://f57e3a3abd0ef2cc35b563e65e047398eba8126e68ce6e6ddd4cfb7835d02733
    Image:         perl
    Image ID:      docker-pullable://perl@sha256:b3f356876d5615e91b808cbdcba0ff618a7ba0c167326bd013c15b2194db03c9
    Port:          <none>
    Host Port:     <none>
    Command:
      perl
      -Mbignum=bpi
      -wle
      print bpi(2000)
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 24 Oct 2019 06:39:24 +0000
      Finished:     Thu, 24 Oct 2019 06:39:30 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-p4ntz (ro)
  consul-connect-envoy-sidecar:
    Container ID:  docker://9d8ec302199ad5271196356140dd20b204e67c56f8036a6b51173a1395d81b77
    Image:         envoyproxy/envoy-alpine:v1.9.1
    Image ID:      docker-pullable://envoyproxy/envoy-alpine@sha256:04ed416733b49260db0a346565ab523c6d2a362cfd29a1ab23a926af77849ecb
    Port:          <none>
    Host Port:     <none>
    Command:
      envoy
      --max-obj-name-len
      256
      --config-path
      /consul/connect-inject/envoy-bootstrap.yaml
    State:          Running
      Started:      Thu, 24 Oct 2019 06:39:25 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      HOST_IP:   (v1:status.hostIP)
    Mounts:
      /consul/connect-inject from consul-connect-inject-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-p4ntz (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-p4ntz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-p4ntz
    Optional:    false
  consul-connect-inject-data:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:   <unset>
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age    From                       Message
  ----    ------     ----   ----                       -------
  Normal  Scheduled  4m44s  default-scheduler          Successfully assigned default/pi-7znmn to compute02-test-2
  Normal  Pulled     4m41s  kubelet, compute02-test-2  Container image "consul:1.6.1" already present on machine
  Normal  Created    4m41s  kubelet, compute02-test-2  Created container consul-connect-inject-init
  Normal  Started    4m41s  kubelet, compute02-test-2  Started container consul-connect-inject-init
  Normal  Pulling    4m41s  kubelet, compute02-test-2  Pulling image "perl"
  Normal  Pulled     4m25s  kubelet, compute02-test-2  Successfully pulled image "perl"
  Normal  Created    4m25s  kubelet, compute02-test-2  Created container pi
  Normal  Started    4m25s  kubelet, compute02-test-2  Started container pi
  Normal  Pulled     4m25s  kubelet, compute02-test-2  Container image "envoyproxy/envoy-alpine:v1.9.1" already present on machine
  Normal  Created    4m25s  kubelet, compute02-test-2  Created container consul-connect-envoy-sidecar
  Normal  Started    4m24s  kubelet, compute02-test-2  Started container consul-connect-envoy-sidecar

Consul info for both Client and Server

Client info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 62
	services = 62
build:
	prerelease =
	revision = 9be6dfc3
	version = 1.6.1
consul:
	acl = disabled
	known_servers = 1
	server = false
runtime:
	arch = amd64
	cpu_count = 32
	goroutines = 590
	max_procs = 32
	os = linux
	version = go1.12.1
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 5
	members = 4
	query_queue = 0
	query_time = 1
Server info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease =
	revision = 9be6dfc3
	version = 1.6.1
consul:
	acl = disabled
	bootstrap = true
	known_datacenters = 1
	leader = true
	leader_addr = 10.233.65.27:8300
	server = true
raft:
	applied_index = 10031
	commit_index = 10031
	fsm_pending = 0
	last_contact = 0
	last_log_index = 10031
	last_log_term = 2
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:53bfea2a-aa00-2ee2-13c5-c9b682e903f9 Address:10.233.65.27:8300}]
	latest_configuration_index = 1
	num_peers = 0
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 2
runtime:
	arch = amd64
	cpu_count = 32
	goroutines = 345
	max_procs = 32
	os = linux
	version = go1.12.1
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 5
	members = 4
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1

Operating system and Environment details

  • OS: Debian 9
  • Kubernetes: 1.15.3
  • Helm: 2.14.3
  • Installer: Kubespray
  • CNI: cilium

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 20 (13 by maintainers)

Most upvoted comments

We came across this issue and this is the (not ideal) way we’re dealing with it at the moment. Would definitely be curious what other ways there might be to solve the issue for sure!

We are sharing process namespaces and using the primary container to kill the sidecars before exit.

shareProcessNamespace: true
containers:
- name: "something"
  image: "busybox"
  command:
  [
    "/bin/bash",
    "-c",
    "something && kill $(pidof envoy) && kill -2 $(pidof consul-k8s) && exit 0",
  ]
  securityContext:
    capabilities:
      add:
        - SYS_PTRACE

As part of consul-k8s releases 1.0.8 and 1.1.3 we added support for graceful shutdown in the proxy lifecycle. The next 1.2.x release that comes out will also have this feature enabled.

With it the feature, you can call /graceful_shutdown on the proxy and it will terminate consul-dataplane.

An example of how to use /graceful_shutdown can be seen in the jobs example on our website.

The most important piece is that you can curl the endpoint:

curl --max-time 2 -s -f -XPOST http://127.0.0.1:20600/graceful_shutdown

Hi, we actually don’t have the lifecycle sidecar anymore. There’s just the envoy sidecar so I don’t think this is required now.

Gotcha yeah this makes sense. Probably we should implement /quitquitquit in the lifecycle sidecar and then that would also make the call to envoy.