argo-workflows: Windows-based pod runs indefinitely
Pre-requisites
- I have double-checked my configuration
- I can confirm the issues exists when I tested with
:latest
- I’d like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
When I submit a workflow with a single task that runs in a windows/servercore:ltsc2022 container, the pod is stuck in Running phase indefinitely, although the command in the container succeeds.
Before we updated from v3.4.8, we used ltsc2019 images without problems. The Windows nodes were upgraded as well, they have the label node.kubernetes.io/windows-build=10.0.20348
.
(Below snippet is an extraction from our workflow with modifications to hide sensitive data.)
Version
v3.4.11
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don’t enter a workflows that uses private images.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: test
namespace: dev
spec:
templates:
- name: whoami-template
nodeSelector:
kubernetes.io/os: windows
metadata:
annotations:
sidecar.istio.io/inject: 'false'
container:
name: whoami-container
image: >-
mcr.microsoft.com/windows/servercore:ltsc2022
command:
- whoami
resources: {}
securityContext:
capabilities:
drop:
- ALL
runAsUser: 10001
runAsNonRoot: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
tolerations:
- key: windows
operator: Exists
effect: NoSchedule
- name: main
dag:
tasks:
- name: whoami
template: whoami-template
entrypoint: main
arguments: {}
serviceAccountName: workflow-execution
podGC:
strategy: OnWorkflowCompletion
securityContext:
seccompProfile:
type: RuntimeDefault
Logs from the workflow controller
time="2023-09-13T12:38:16.903Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2023-09-13T12:38:16.909Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2023-09-13T12:38:16.909Z" level=info msg="Updated phase -> Running" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=info msg="DAG node test initialized Running" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=warning msg="was unable to obtain the node for test-2372842584, taskName whoami"
time="2023-09-13T12:38:16.909Z" level=warning msg="was unable to obtain the node for test-2372842584, taskName whoami"
time="2023-09-13T12:38:16.909Z" level=info msg="All of node test.whoami dependencies [] completed" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=info msg="Pod node test-2372842584 initialized Pending" namespace=dev workflow=test
time="2023-09-13T12:38:16.984Z" level=info msg="Created pod: test.whoami (test-whoami-template-2372842584)" namespace=dev workflow=test
time="2023-09-13T12:38:16.984Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:16.984Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:16.984Z" level=info msg="Workflow to be dehydrated" Workflow Size=2213
time="2023-09-13T12:38:17.002Z" level=info msg="Workflow update successful" namespace=dev phase=Running resourceVersion=93379873 workflow=test
time="2023-09-13T12:38:26.904Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:26.904Z" level=info msg="Task-result reconciliation" namespace=dev numObjs=0 workflow=test
time="2023-09-13T12:38:26.904Z" level=info msg="node changed" namespace=dev new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=test-2372842584 old.message= old.phase=Pending old.progress=0/1 workflow=test
time="2023-09-13T12:38:26.905Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:26.905Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:26.905Z" level=info msg="Workflow to be dehydrated" Workflow Size=2498
time="2023-09-13T12:38:26.928Z" level=info msg="Workflow update successful" namespace=dev phase=Running resourceVersion=93379981 workflow=test
time="2023-09-13T12:38:36.929Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:36.929Z" level=info msg="Task-result reconciliation" namespace=dev numObjs=0 workflow=test
time="2023-09-13T12:38:36.929Z" level=info msg="node unchanged" namespace=dev nodeID=test-2372842584 workflow=test
time="2023-09-13T12:38:36.930Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:36.930Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="Task-result reconciliation" namespace=dev numObjs=0 workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="node changed" namespace=dev new.message= new.phase=Running new.progress=0/1 nodeID=test-2372842584 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="Workflow to be dehydrated" Workflow Size=2482
time="2023-09-13T12:38:48.141Z" level=info msg="Workflow update successful" namespace=dev phase=Running resourceVersion=93380211 workflow=test
time="2023-09-13T12:38:56.430Z" level=info msg="Received Workflow Controller config map dev/argo-workflows-configmap-c9btt7bg2m update"
time="2023-09-13T12:38:56.435Z" level=info msg="Configuration:\nartifactRepository:\n archiveLogs: true\n azure:\n accountKeySecret:\n key: storage-account-access-key\n name: storage-account-credentials\n container: workflow-artifacts\n endpoint: https://*******************.blob.core.windows.net\ninitialDelay: 0s\nmetricsConfig: {}\nnamespaceParallelism: 5\nnodeEvents: {}\nparallelism: 10\npodSpecLogStrategy: {}\nresourceRateLimit:\n burst: 1\n limit: 10\nsso:\n clientId:\n key: \"\"\n clientSecret:\n key: \"\"\n issuer: \"\"\n redirectUrl: \"\"\n sessionExpiry: 0s\ntelemetryConfig: {}\n"
time="2023-09-13T12:38:56.435Z" level=info msg="Persistence configuration disabled"
time="2023-09-13T12:38:56.435Z" level=info executorImage="quay.io/argoproj/argoexec:v3.4.11" executorImagePullPolicy= managedNamespace=dev
time="2023-09-13T12:38:58.142Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:58.142Z" level=info msg="Task-result reconciliation" namespace=dev numObjs=0 workflow=test
time="2023-09-13T12:38:58.142Z" level=info msg="node unchanged" namespace=dev nodeID=test-2372842584 workflow=test
time="2023-09-13T12:38:58.142Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:58.142Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:59.505Z" level=info msg="Received Workflow Controller config map dev/argo-workflows-configmap-c9btt7bg2m update"
time="2023-09-13T12:38:59.509Z" level=info msg="Configuration:\nartifactRepository:\n archiveLogs: true\n azure:\n accountKeySecret:\n key: storage-account-access-key\n name: storage-account-credentials\n container: workflow-artifacts\n endpoint: https://********************.blob.core.windows.net\ninitialDelay: 0s\nmetricsConfig: {}\nnamespaceParallelism: 5\nnodeEvents: {}\nparallelism: 10\npodSpecLogStrategy: {}\nresourceRateLimit:\n burst: 1\n limit: 10\nsso:\n clientId:\n key: \"\"\n clientSecret:\n key: \"\"\n issuer: \"\"\n redirectUrl: \"\"\n sessionExpiry: 0s\ntelemetryConfig: {}\n"
time="2023-09-13T12:38:59.509Z" level=info msg="Persistence configuration disabled"
time="2023-09-13T12:38:59.509Z" level=info executorImage="quay.io/argoproj/argoexec:v3.4.11" executorImagePullPolicy= managedNamespace=dev
Logs from in your workflow’s wait container
time="2023-09-13T12:38:35.970Z" level=info msg="Starting Workflow Executor" version=latest+v3.4.11.dirty
time="2023-09-13T12:38:36.124Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-09-13T12:38:36.124Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=dev podName=test-whoami-template-2372842584 template="{\"name\":\"whoami-template\",\"inputs\":{},\"outputs\":{},\"nodeSelector\":{\"kubernetes.io/os\":\"windows\"},\"metadata\":{\"annotations\":{\"sidecar.istio.io/inject\":\"false\"}},\"container\":{\"name\":\"whoami-container\",\"image\":\"mcr.microsoft.com/windows/servercore:ltsc2022\",\"command\":[\"whoami\"],\"resources\":{},\"securityContext\":{\"capabilities\":{\"drop\":[\"ALL\"]},\"runAsUser\":10001,\"runAsNonRoot\":true,\"allowPrivilegeEscalation\":false,\"seccompProfile\":{\"type\":\"RuntimeDefault\"}}},\"archiveLocation\":{\"archiveLogs\":true,\"azure\":{\"endpoint\":\"https://****************.blob.core.windows.net\",\"container\":\"workflow-artifacts\",\"accountKeySecret\":{\"name\":\"storage-account-credentials\",\"key\":\"storage-account-access-key\"},\"blob\":\"test/test-whoami-template-2372842584\"}},\"tolerations\":[{\"key\":\"windows\",\"operator\":\"Exists\",\"effect\":\"NoSchedule\"}]}" version="&Version{Version:latest+v3.4.11.dirty,BuildDate:2023-09-08T00:11:10Z,GitCommit:v3.4.11,GitTag:unknown,GitTreeState:,GoVersion:go1.20,Compiler:gc,Platform:windows/amd64,}"
time="2023-09-13T12:38:36.124Z" level=info msg="Starting deadline monitor"
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 1
- Comments: 20 (12 by maintainers)
Commits related to this issue
- fix(windows): prevent infinite run. Fixes #11810 windows based workflows already call .Wait() in signal_windows.go, calling it twice will result in not exiting at all. Unix based workflows prevent t... — committed to helio/argo-workflows by mweibel 9 months ago
- fix(windows): prevent infinite run. Fixes #11810 windows based workflows already call .Wait() in signal_windows.go, calling it twice will result in not exiting at all. Unix based workflows prevent t... — committed to helio/argo-workflows by mweibel 9 months ago
- fix(windows): prevent infinite run. Fixes #11810 windows based workflows already call .Wait() in signal_windows.go, calling it twice will result in not exiting at all. Unix based workflows prevent t... — committed to helio/argo-workflows by mweibel 9 months ago
- fix(windows): prevent infinite run. Fixes #11810 windows based workflows already call .Wait() in signal_windows.go, calling it twice will result in not exiting at all. Unix based workflows prevent t... — committed to helio/argo-workflows by mweibel 9 months ago
- fix(windows): prevent infinite run. Fixes #11810 windows based workflows already call .Wait() in signal_windows.go, calling it twice will result in not exiting at all. Unix based workflows prevent t... — committed to helio/argo-workflows by mweibel 9 months ago
- fix(windows): prevent infinite run. Fixes #11810 (#11993) Signed-off-by: Michael Weibel <michael@helio.exchange> — committed to argoproj/argo-workflows by mweibel 9 months ago
- fix(windows): prevent infinite run. Fixes #11810 windows based workflows already call .Wait() in signal_windows.go, calling it twice will result in not exiting at all. Unix based workflows prevent t... — committed to helio/argo-workflows by mweibel 9 months ago
- fix(windows): prevent infinite run. Fixes #11810 (#11993) Signed-off-by: Michael Weibel <michael@helio.exchange> — committed to argoproj/argo-workflows by mweibel 9 months ago
@boiledfroginthewell yeah, that’s the issue indeed.
I contributed a fix in #11993 which just removes the Wait again for windows. Also I opened #11994 for generally improving Windows CI runs since they don’t seem to be run automatically, currently.
Might be related to https://github.com/argoproj/argo-workflows/commit/1bcdba4295125812cc27c0fed5ad831472988597 cc @cbuchli
Alright I did some testing, here’s what I found:
Use Case
Argo is deployed on Azure Kubernetes, I am testing with our devtest cluster - currently running Kubernetes version 1.26.6.
We’re using argo as a cron scheduler to run executables that used to be on-prem. The exe’s are pulled from an azure fileshare and executed inside the windows container. The workflows themselves are pretty basic. I know, they should be natively containerized, but we have to work with what we got sometimes lol.
Node OS Versions
Windows 2022 node version - AKSWindows-2022-containerd-20348.1970.230914 Windows 2019 node version - AKSWindows-2019-containerd-17763.4851.230914
Version 3.4.8
pods schedule, run, complete, and exit
Image pull error argoexec 3.4.8
Error: failed to create containerd task: failed to create shim task: hcs::CreateComputeSystem main: The container operating system does not match the host operating system.: unknown
Image pull error argoexec 3.4.8
Version 3.4.9
Failed to pull image "quay.io/argoproj/argoexec:v3.4.9": rpc error: code = NotFound desc = failed to pull and unpack image "quay.io/argoproj/argoexec:v3.4.9": no match for platform in manifest: not found
Failed to pull image "quay.io/argoproj/argoexec:v3.4.9": rpc error: code = NotFound desc = failed to pull and unpack image "quay.io/argoproj/argoexec:v3.4.9": no match for platform in manifest: not found
Schedules but doesn’t run exe - time="2023-09-30T05:00:21.780Z" level=info msg="finished streaming call with code Unauthenticated" error="rpc error: code = Unauthenticated desc = token not valid for running mode" grpc.code=Unauthenticated grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2023-09-30T05:00:21Z" grpc.time_ms=0.059 span.kind=server system=grpc error in argo-server pod logs
schedules, runs, completes, and exits
Version 3.4.10
schedules, runs, completes but does not exit
It appears that the jump from 3.4.9 to 3.4.10 has the breaking change - what that is I haven’t looked into yet this evening. I feel like all of this info might prove to be useful.
Let me know if you want me to run any more tests - we’re on 3.4.9 for now using 22/22 base images and node os versions and things seem to be running okay. I will keep an eye on our windows based workflows this weekend and report back. I don’t anticipate any issues but I will update this thread regardless.
@cbuchli @terrytangyuan @pizza-prosciutto
I’ll look into that. thanks!
I found
Wait()
is called twice on Windows; below and https://github.com/argoproj/argo-workflows/pull/11368https://github.com/argoproj/argo-workflows/blob/b2e3676503999e699cd3aaca62ff7bfff64464af/workflow/executor/os-specific/signal_windows.go#L33-L34
I suspect multiple
Wait()
call causes this issue. The following script hung on Windows 8.1 running inside VirualBox.@boiledfroginthewell Do you have any insights maybe?
@Freddybob4244 Yes, this is very useful, thank you very much.