pipelines: Offical artifact passing example fails with PNS executor

What steps did you take:

I am using the offical example https://github.com/argoproj/argo/blob/master/examples/artifact-passing.yaml that runs fine out of the box with the argo docker executor.

Then i changed the executor to pns

apiVersion: v1
data:
  config: |
    {
    namespace: kubeflow,
    containerRuntimeExecutor: pns,
    executorImage: gcr.io/ml-pipeline/argoexec:v2.7.5-license-compliance,
    ...

What happened:

Every pipeline that passes outputs (including the offical example) is now failing. The problem seems to be that the main container exits properly and the wait container cannot chroot into it anymore:

"executor error: could not chroot into main for artifact collection: container may have exited too quickly"

The docker executor works around this by abusing docker.sock to copy the outputs from the terminated main container which is obviously completely infeasible in production.

The funny thing is that you can manually mount an emptydir under /tmp/outputs and add the proper output path (e.g. tmp/outputs/numbers/data) to op.output_artifact_paths.

def add_emptydir(op):
    from kubernetes import client as k8s_client
    op.add_volume(k8s_client.V1Volume(name='outputs', empty_dir=k8s_client.V1EmptyDirVolumeSource()))
    op.container.add_volume_mount(k8s_client.V1VolumeMount(name='outputs', mount_path='tmp/outputs'))
    op.output_artifact_paths={
        'mlpipeline-ui-metadata': 'tmp/outputs/mlpipeline-ui-metadata.json',
        'mlpipeline-metrics': 'tmp/outputs/mlpipeline-metrics.json',
        'extract-as-artifact': 'tmp/outputs/numbers/data',
    }
    return op

Then the output file (tmp/outputs/numbers/data) is successfully extracted via the mirrored mounts functionality, but extracting the same file with chroot fails.

What did you expect to happen:

I expect PNS to extract the output successfully

Environment:

i just tried kubeflow pipelines on a Kubernetes 1.17 (Azure) and 1.18 (minikube) cluster with docker as container engine.

How did you deploy Kubeflow Pipelines (KFP)?

Download and extract https://github.com/kubeflow/pipelines/archive/1.0.0.zip Install with kubectl apply -k ‘/home/julius/Schreibtisch/kubeflow/pipelines-1.0.0/manifests/kustomize/cluster-scoped-resources’ kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io kubectl apply -k ‘/home/julius/Schreibtisch/kubeflow/pipelines-1.0.0/manifests/kustomize/env/dev’

KFP version: I am using the 1.0.0 release https://github.com/kubeflow/pipelines/releases/tag/1.0.0.

KFP SDK version: [julius@julius-asus ~]$ pip list | grep kfp kfp 1.0.0
kfp-server-api 1.0.0

Anything else you would like to add:

I also experimented with op.file_outputs without success. I also experimented with emptydir and the k8sapi executor without success. I tried newer argo workflow and exec images (2.8.3 and 2.9.3 in deployment/workflow-controller) without success.

So i am wondering why pns is working for others.

Next to the offical examples I am also using some very simple pipelines

@func_to_container_op
def write_numbers_1(numbers_path: OutputPath(str), start: int = 0, count: int = 10):
    import time, datetime
    time.sleep(30) # should not be necessary with newer versions of argo
    '''Write numbers to file'''
    print('numbers_path:', numbers_path)
    with open(numbers_path, 'w') as writer:
        for i in range(start, count):
            writer.write(str(i) + '\n')
    print('finished', datetime.datetime.now())

which work perfectly fine with the docker executor and fail miserably with pns.

/kind bug

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 18 (13 by maintainers)

Most upvoted comments

This is my current status

curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 && chmod +x minikube
./minikube version # 1.12.1 kubernetes 1.18.3 Docker 19.03.2 CRI-O 1.17.3
 ./minikube delete && rm -rf ~/.kube ~/.minikube  # make sure there is no old cluster
./minikube start --memory 8192 --cpus 4 --container-runtime=cri-o --driver=docker --extra-config=apiserver.enable-admission-plugins=PodSecurityPolicy --addons=pod-security-policy
# --driver=kvm2 did not work

# Container runtimes:  Docker, rkt, CRI-O and containerd
# between tests: ./minikube delete && rm -rf ~/.kube ~/.minikube 
# ./minikube start --container-runtime=containerd --driver=docker works with pns
# ./minikube start --container-runtime=docker --driver=docker works with docker and pns
# ./minikube start --container-runtime=CRI-O --driver=docker works with pns
# k3s(containerd) works with pns and is very fast

The only cluster not working is on azure. I will have to check back with the maintainer of the azure cluster and report back here. Btw. the azure cluster is a bit outdated compared to k3s and minikube

  Container Runtime Version:  docker://18.9.1
  Kubelet Version:            v1.17.7

“Humm by chance did you have any psp in the previous deployment? I noticed that psp without SYS_PTRACE throw that error” You will get an error on pod startup if SYS_Ptrace is not enabled. This is the PSP that works for minikube and pns

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: kubeflow
spec:
  allowPrivilegeEscalation: true
  allowedCapabilities:
  - 'SYS_PTRACE'
  fsGroup:
    rule: RunAsAny
  hostIPC: false
  hostNetwork: false
  hostPID: false
  privileged: false
  runAsUser:
    rule: RunAsAny
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  volumes:
  - configMap
  - emptyDir
  - projected
  - secret
  - downwardAPI
  - persistentVolumeClaim

Maybe it helps to add CAP_SYS_CHROOT ? Maye then allowPrivilegeEscalation: true becomes unnecessary

“Can you give more details about the infeasibility?” Well hostpath and docker.sock access is a security issue. You cannot expect anyone to manage your cluster with that security hole.

juliusvonkohout on Jul 23, 2020