argo-workflows: Windows-based pod runs indefinitely

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I’d like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

When I submit a workflow with a single task that runs in a windows/servercore:ltsc2022 container, the pod is stuck in Running phase indefinitely, although the command in the container succeeds.

Before we updated from v3.4.8, we used ltsc2019 images without problems. The Windows nodes were upgraded as well, they have the label node.kubernetes.io/windows-build=10.0.20348.

(Below snippet is an extraction from our workflow with modifications to hide sensitive data.)

Version

v3.4.11

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don’t enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: test
  namespace: dev
spec:
  templates:
    - name: whoami-template
      nodeSelector:
        kubernetes.io/os: windows
      metadata:
        annotations:
          sidecar.istio.io/inject: 'false'
      container:
        name: whoami-container
        image: >-
          mcr.microsoft.com/windows/servercore:ltsc2022
        command:
          - whoami
        resources: {}
        securityContext:
          capabilities:
            drop:
              - ALL
          runAsUser: 10001
          runAsNonRoot: true
          allowPrivilegeEscalation: false
          seccompProfile:
            type: RuntimeDefault
      tolerations:
        - key: windows
          operator: Exists
          effect: NoSchedule
    - name: main
      dag:
        tasks:
          - name: whoami
            template: whoami-template
  entrypoint: main
  arguments: {}
  serviceAccountName: workflow-execution
  podGC:
    strategy: OnWorkflowCompletion
  securityContext:
    seccompProfile:
      type: RuntimeDefault

Logs from the workflow controller

time="2023-09-13T12:38:16.903Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2023-09-13T12:38:16.909Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2023-09-13T12:38:16.909Z" level=info msg="Updated phase  -> Running" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=info msg="DAG node test initialized Running" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=warning msg="was unable to obtain the node for test-2372842584, taskName whoami"
time="2023-09-13T12:38:16.909Z" level=warning msg="was unable to obtain the node for test-2372842584, taskName whoami"
time="2023-09-13T12:38:16.909Z" level=info msg="All of node test.whoami dependencies [] completed" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=info msg="Pod node test-2372842584 initialized Pending" namespace=dev workflow=test
time="2023-09-13T12:38:16.984Z" level=info msg="Created pod: test.whoami (test-whoami-template-2372842584)" namespace=dev workflow=test
time="2023-09-13T12:38:16.984Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:16.984Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:16.984Z" level=info msg="Workflow to be dehydrated" Workflow Size=2213
time="2023-09-13T12:38:17.002Z" level=info msg="Workflow update successful" namespace=dev phase=Running resourceVersion=93379873 workflow=test
time="2023-09-13T12:38:26.904Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:26.904Z" level=info msg="Task-result reconciliation" namespace=dev numObjs=0 workflow=test
time="2023-09-13T12:38:26.904Z" level=info msg="node changed" namespace=dev new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=test-2372842584 old.message= old.phase=Pending old.progress=0/1 workflow=test
time="2023-09-13T12:38:26.905Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:26.905Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:26.905Z" level=info msg="Workflow to be dehydrated" Workflow Size=2498
time="2023-09-13T12:38:26.928Z" level=info msg="Workflow update successful" namespace=dev phase=Running resourceVersion=93379981 workflow=test
time="2023-09-13T12:38:36.929Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:36.929Z" level=info msg="Task-result reconciliation" namespace=dev numObjs=0 workflow=test
time="2023-09-13T12:38:36.929Z" level=info msg="node unchanged" namespace=dev nodeID=test-2372842584 workflow=test
time="2023-09-13T12:38:36.930Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:36.930Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="Task-result reconciliation" namespace=dev numObjs=0 workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="node changed" namespace=dev new.message= new.phase=Running new.progress=0/1 nodeID=test-2372842584 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="Workflow to be dehydrated" Workflow Size=2482
time="2023-09-13T12:38:48.141Z" level=info msg="Workflow update successful" namespace=dev phase=Running resourceVersion=93380211 workflow=test
time="2023-09-13T12:38:56.430Z" level=info msg="Received Workflow Controller config map dev/argo-workflows-configmap-c9btt7bg2m update"
time="2023-09-13T12:38:56.435Z" level=info msg="Configuration:\nartifactRepository:\n  archiveLogs: true\n  azure:\n    accountKeySecret:\n      key: storage-account-access-key\n      name: storage-account-credentials\n    container: workflow-artifacts\n    endpoint: https://*******************.blob.core.windows.net\ninitialDelay: 0s\nmetricsConfig: {}\nnamespaceParallelism: 5\nnodeEvents: {}\nparallelism: 10\npodSpecLogStrategy: {}\nresourceRateLimit:\n  burst: 1\n  limit: 10\nsso:\n  clientId:\n    key: \"\"\n  clientSecret:\n    key: \"\"\n  issuer: \"\"\n  redirectUrl: \"\"\n  sessionExpiry: 0s\ntelemetryConfig: {}\n"
time="2023-09-13T12:38:56.435Z" level=info msg="Persistence configuration disabled"
time="2023-09-13T12:38:56.435Z" level=info executorImage="quay.io/argoproj/argoexec:v3.4.11" executorImagePullPolicy= managedNamespace=dev
time="2023-09-13T12:38:58.142Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:58.142Z" level=info msg="Task-result reconciliation" namespace=dev numObjs=0 workflow=test
time="2023-09-13T12:38:58.142Z" level=info msg="node unchanged" namespace=dev nodeID=test-2372842584 workflow=test
time="2023-09-13T12:38:58.142Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:58.142Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:59.505Z" level=info msg="Received Workflow Controller config map dev/argo-workflows-configmap-c9btt7bg2m update"
time="2023-09-13T12:38:59.509Z" level=info msg="Configuration:\nartifactRepository:\n  archiveLogs: true\n  azure:\n    accountKeySecret:\n      key: storage-account-access-key\n      name: storage-account-credentials\n    container: workflow-artifacts\n    endpoint: https://********************.blob.core.windows.net\ninitialDelay: 0s\nmetricsConfig: {}\nnamespaceParallelism: 5\nnodeEvents: {}\nparallelism: 10\npodSpecLogStrategy: {}\nresourceRateLimit:\n  burst: 1\n  limit: 10\nsso:\n  clientId:\n    key: \"\"\n  clientSecret:\n    key: \"\"\n  issuer: \"\"\n  redirectUrl: \"\"\n  sessionExpiry: 0s\ntelemetryConfig: {}\n"
time="2023-09-13T12:38:59.509Z" level=info msg="Persistence configuration disabled"
time="2023-09-13T12:38:59.509Z" level=info executorImage="quay.io/argoproj/argoexec:v3.4.11" executorImagePullPolicy= managedNamespace=dev

Logs from in your workflow’s wait container

time="2023-09-13T12:38:35.970Z" level=info msg="Starting Workflow Executor" version=latest+v3.4.11.dirty
time="2023-09-13T12:38:36.124Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-09-13T12:38:36.124Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=dev podName=test-whoami-template-2372842584 template="{\"name\":\"whoami-template\",\"inputs\":{},\"outputs\":{},\"nodeSelector\":{\"kubernetes.io/os\":\"windows\"},\"metadata\":{\"annotations\":{\"sidecar.istio.io/inject\":\"false\"}},\"container\":{\"name\":\"whoami-container\",\"image\":\"mcr.microsoft.com/windows/servercore:ltsc2022\",\"command\":[\"whoami\"],\"resources\":{},\"securityContext\":{\"capabilities\":{\"drop\":[\"ALL\"]},\"runAsUser\":10001,\"runAsNonRoot\":true,\"allowPrivilegeEscalation\":false,\"seccompProfile\":{\"type\":\"RuntimeDefault\"}}},\"archiveLocation\":{\"archiveLogs\":true,\"azure\":{\"endpoint\":\"https://****************.blob.core.windows.net\",\"container\":\"workflow-artifacts\",\"accountKeySecret\":{\"name\":\"storage-account-credentials\",\"key\":\"storage-account-access-key\"},\"blob\":\"test/test-whoami-template-2372842584\"}},\"tolerations\":[{\"key\":\"windows\",\"operator\":\"Exists\",\"effect\":\"NoSchedule\"}]}" version="&Version{Version:latest+v3.4.11.dirty,BuildDate:2023-09-08T00:11:10Z,GitCommit:v3.4.11,GitTag:unknown,GitTreeState:,GoVersion:go1.20,Compiler:gc,Platform:windows/amd64,}"
time="2023-09-13T12:38:36.124Z" level=info msg="Starting deadline monitor"

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Reactions: 1
  • Comments: 20 (12 by maintainers)

Commits related to this issue

Most upvoted comments

@boiledfroginthewell yeah, that’s the issue indeed.

I contributed a fix in #11993 which just removes the Wait again for windows. Also I opened #11994 for generally improving Windows CI runs since they don’t seem to be run automatically, currently.

Alright I did some testing, here’s what I found:

Use Case

Argo is deployed on Azure Kubernetes, I am testing with our devtest cluster - currently running Kubernetes version 1.26.6.

We’re using argo as a cron scheduler to run executables that used to be on-prem. The exe’s are pulled from an azure fileshare and executed inside the windows container. The workflows themselves are pretty basic. I know, they should be natively containerized, but we have to work with what we got sometimes lol.

Node OS Versions

Windows 2022 node version - AKSWindows-2022-containerd-20348.1970.230914 Windows 2019 node version - AKSWindows-2019-containerd-17763.4851.230914

Version 3.4.8

  • Windows 2019 image / 2019 node OS - pods schedule, run, complete, and exit
  • Windows 2022 image / 2019 node os - Image pull error argoexec 3.4.8
  • Windows 2019 image / 2022 node os - Error: failed to create containerd task: failed to create shim task: hcs::CreateComputeSystem main: The container operating system does not match the host operating system.: unknown
  • 2022 image / 2022 node os - Image pull error argoexec 3.4.8

Version 3.4.9

  • Win 2019 image / 2019 node os - Failed to pull image "quay.io/argoproj/argoexec:v3.4.9": rpc error: code = NotFound desc = failed to pull and unpack image "quay.io/argoproj/argoexec:v3.4.9": no match for platform in manifest: not found
  • Win 2022 image / 2019 node os - Failed to pull image "quay.io/argoproj/argoexec:v3.4.9": rpc error: code = NotFound desc = failed to pull and unpack image "quay.io/argoproj/argoexec:v3.4.9": no match for platform in manifest: not found
  • Win 2019 image / 2022 node OS - Schedules but doesn’t run exe - time="2023-09-30T05:00:21.780Z" level=info msg="finished streaming call with code Unauthenticated" error="rpc error: code = Unauthenticated desc = token not valid for running mode" grpc.code=Unauthenticated grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2023-09-30T05:00:21Z" grpc.time_ms=0.059 span.kind=server system=grpc error in argo-server pod logs
  • Win 2022 image / 2022 node os - schedules, runs, completes, and exits

Version 3.4.10

  • win 22 image / 2022 node os - schedules, runs, completes but does not exit

It appears that the jump from 3.4.9 to 3.4.10 has the breaking change - what that is I haven’t looked into yet this evening. I feel like all of this info might prove to be useful.

Let me know if you want me to run any more tests - we’re on 3.4.9 for now using 22/22 base images and node os versions and things seem to be running okay. I will keep an eye on our windows based workflows this weekend and report back. I don’t anticipate any issues but I will update this thread regardless.

@cbuchli @terrytangyuan @pizza-prosciutto

I’ll look into that. thanks!

I found Wait() is called twice on Windows; below and https://github.com/argoproj/argo-workflows/pull/11368

https://github.com/argoproj/argo-workflows/blob/b2e3676503999e699cd3aaca62ff7bfff64464af/workflow/executor/os-specific/signal_windows.go#L33-L34

I suspect multiple Wait() call causes this issue. The following script hung on Windows 8.1 running inside VirualBox.

package main

import (
	"fmt"
	"os/exec"
)

func main() {
	cmd := exec.Command("whoami")
	cmd.Start()

	cmd.Process.Wait()
	fmt.Printf("wait 1\n")
	cmd.Wait()
	fmt.Printf("wait 2\n")

	fmt.Printf("done\n")
}

@boiledfroginthewell Do you have any insights maybe?

@Freddybob4244 Yes, this is very useful, thank you very much.