linkerd2: Pod controlled by a Job does not exit after after main container completes

Bug Report

What is the issue?

Pods that are controlled by Jobs are not terminating when the main container exits

How can it be reproduced?

Create a job with linkerd sidecar container

Logs, error output, etc

main container logs are as usual, sidecar container logs are as usual

`linkerd check` output

Status check results are [ok]

Environment

Kubernetes Version: 1.11.4
Cluster Environment: AWS (kops)
Host OS: Container Linux by CoreOS 1911.3.0 (Rhyolite)
Linkerd version: edge-18.11.2 (client and server)

Possible solution

I think that when a container within a pod controlled by a Job completes, the sidecar should exit as well.

Additional context

The sidecar was created using linkerd inject

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 27 (7 by maintainers)

Links to this issue

How to integrate Istio with Ververica Platform – Ververica Platform Help Center

Most upvoted comments

@alexklibisz thanks for the workaround! I got it working on a k8s job, but couldn’t figure out how to make it so in an Argo flow (shareProcessNamespace doesn’t seem to fit anywhere). Any chance you’d have time to share a gist for it?

I updated the comment above.

alexklibisz on Aug 14, 2020

Can confirm the original solution I posted is still working fine after about three months.

We also have some some crons running on Argo Workflows with linkerd sidecars. shareProcessNamespace doesn’t seem to be an available option in argo workflow specifications. We were able to get Argo to kill the sidecars only after setting the right annotations on the job template:

templates:
  - name: job
    metadata:
      annotations:
        linkerd.io/inject: enabled
        config.linkerd.io/skip-outbound-ports: 443
    container:
      image: ...

The key part is the skip-outbound-ports… I set this up a while ago so I don’t remember the precise reasoning. It was some sort of deadlock where the argo sidecar container couldn’t kill the linkerd sidecar container because argo was trying to communicate over 443, which was proxied by linkerd, so linkerd refused to die because it still had open connections over 443, etc. Fun stuff!

alexklibisz on Aug 14, 2020

I believe I found a solution to this that doesn’t require waiting for a new k8s feature or significantly altering the main job process.

Pods have a shareProcessNamespace setting. This lets containers in a pod see and kill the processes running in other containers.

The solution: Assume you can identify the process id for the main workload in your job/cronjob. Then you can add your own sidecar container that checks to see if your job process is running, sleeps, and repeats until the job process exits. Once it exits, you kill the linkerd2-proxy process, which makes that container exit, and successfully ends the job/cronjob.

Here’s an example which assumes your job process is called java. I assume it would work for any other process, you just have to be able to return the process id by running pgrep <name-of-my-process>.

apiVersion: batch/v1
kind: Job
metadata:
  name: my-java-job-that-uses-linkerd2-injection
spec:
  template:
    metadata:
      annotations:
        # Inject linkerd2 proxy sidecar.
        linkerd.io/inject: enabled
    spec:
      containers:
        # This is your main workload. In this case lets assume it's a java process.
        - name: job
          image: com.foo.bar/my-java-job:latest
          resources:
            limits:
              memory: ...
              cpu: ...
            requests:
              memory: ...
              cpu: ...
        # This sidecar monitors the java process that runs the main job and kills the linkerd-proxy once java exits.
        # Note that it's necessary to set `shareProcessNamespace: true` in `spec.template.spec` for this to work.
        - name: linkerd-terminator
          image: ubuntu:19.04
          command:
            - sh
            - "-c"
            - |
              /bin/bash <<'EOSCRIPT'
              set -e
              # Check for the java process and sleep 5 seconds until the java process exits.
              while true; do pgrep java || break; sleep 5; done
              # After the java process exits, 
              kill $(pgrep linkerd2-proxy)
              EOSCRIPT
          resources:
            limits:
              cpu: 10m
              memory: 20M
            requests:
              cpu: 10m
              memory: 20M
      shareProcessNamespace: true # Don't forget this part!

For context, we are running k8s version 1.15.7.

alexklibisz on Mar 5, 2020

provide the annotation and set it to disabled on the job’s pod spec

christianhuening on Aug 27, 2019

@wmorgan @laukaichung I didn’t find an existing one, so I created an issue about the missing documentation.

rachelvwood on Jun 3, 2021

I’ll update once I try 😉

On Mon, Mar 9, 2020, 5:59 PM Alex Klibisz notifications@github.com wrote:

Yeah I didn’t set up the termination process to run as root.

Got it. So it’s working now?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/linkerd/linkerd2/issues/1869?email_source=notifications&email_token=AABHMIWH7TXQM6MTUY5RK7TRGWNJHA5CNFSM4GF3VQA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOJTDFQ#issuecomment-596849046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABHMIXNFXJ3JRU3RHP2VALRGWNJHANCNFSM4GF3VQAQ .

Enrico2 on Mar 10, 2020