argo-workflows: Successfully finished pods show stuck in pending phase indefinitely in GKE + Kubernetes v1.20
Summary
We were running reasonably complex Argo workflows without issues for a long time. However around the time we updated Kubernetes version to 1.19.10-gke.1000 (Running ARGO in GKE) we started experiencing frequent problems with workflow getting stuck because Pod that was successfully started by Argo and finished is shown stuck in the pending state in Argo. (Even though we can see from logs that the main container successfully finished) We have tried using PNS executor and k8sapi executors but that did not fix the issue. We have removed Argo namespace and recreated it again and the issue is still happening. We updated from Argo 2.x to 3.0.8 to 3.1.1 and to stable 3.0.3 and it still occurred. Currently we are at latest tag (argoproj/workflow-controller:v3.0.3)
Diagnostics
What Kubernetes provider are you using? We are using GKE with Kubernetes version 1.19.10-gke.1000
What version of Argo Workflows are you running? Tested on multiple versions. Currently runing 3.0.3 but it started in 2.x version and happened in 3.1.1 and 3.0.8
What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary Failed with default(I guess docker) K8SAPI and PNS
Did this work in a previous version? I.e. is it a regression? We are not sure if that was the cause but if worked without issues 2.6 on GKE before updating kubernetes to 1.19
# Simplified failing part of workflow.
Parents-parent
our-workflow-1625018400-118234800:
boundaryID: our-workflow-1625018400-2940179300
children:
- our-workflow-1625018400-1820417972
- our-workflow-1625018400-1463043834
displayName: '[2]'
finishedAt: null
id: our-workflow-1625018400-118234800
name: our-workflow-1625018400[2].fill-supervised-product-clusters[0].fill-supervised-product-cluster(19:configs:["things"],products:["params"],supervised-configs:[])[2]
phase: Running
progress: 0/2
startedAt: "2021-06-30T09:34:44Z"
templateScope: local/our-workflow-1625018400
type: StepGroup
Parent stuck running
our-workflow-1625018400-1820417972:
boundaryID: our-workflow-1625018400-2940179300
children:
- our-workflow-1625018400-655392735
displayName: fill-products(0:params)
finishedAt: null
id: our-workflow-1625018400-1820417972
inputs:
parameters:
- name: products
value: send_val
name: our-workflow-1625018400[2].fill-supervised-product-clusters[0].fill-supervised-product-cluster(19:configs:["things"],products:["params"],supervised-configs:[])[2].fill-products(0:params)
phase: Running
progress: 0/1
startedAt: "2021-06-30T09:34:44Z"
templateName: fill-products
templateScope: local/our-workflow-1625018400
type: Retry
Finished pod stuck pending
our-workflow-1625018400-655392735:
boundaryID: our-workflow-1625018400-2940179300
displayName: fill-products(0:params)(0)
finishedAt: null
id: our-workflow-1625018400-655392735
inputs:
parameters:
- name: products
value: send_val
name: our-workflow-1625018400[2].fill-supervised-product-clusters[0].fill-supervised-product-cluster(19:configs:["things"],products:["params"],supervised-configs:[])[2].fill-products(0:params)(0)
phase: Pending
progress: 0/1
startedAt: "2021-06-30T09:34:44Z"
templateName: fill-products
templateScope: local/our-workflow-1625018400
type: Pod
# Logs from the workflow controller:
# Workflow container restarts occasionally failing logs:
# Edit: NOTE: Pods get stuck even if workflow controller doesn't restart.
time="2021-06-30T08:03:26.894Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=349445621 workflow=workflow-working-fine-1625040000
time="2021-06-30T08:03:26.898Z" level=info msg="SG Outbound nodes of affected-workflow-1625018400-3050482061 are [affected-workflow-1625018400-3894231772]" namespace=default workflow=affected-workflow-1625018400
time="2021-06-30T08:03:26.898Z" level=info msg="SG Outbound nodes of affected-workflow-1625018400-483040858 are [affected-workflow-1625018400-3675982209]" namespace=default workflow=affected-workflow-1625018400
time="2021-06-30T08:03:26.899Z" level=info msg="SG Outbound nodes of affected-workflow-1625018400-744350190 are [affected-workflow-1625018400-952860445]" namespace=default workflow=affected-workflow-1625018400
E0630 08:03:27.386046 1 leaderelection.go:361] Failed to update lock: Put "https://10.47.240.1:443/apis/coordination.k8s.io/v1/namespaces/argo/leases/workflow-controller": context deadline exceeded
I0630 08:03:27.681192 1 leaderelection.go:278] failed to renew lease argo/workflow-controller: timed out waiting for the condition
time="2021-06-30T08:03:28.071Z" level=info msg="Update leases 409"
time="2021-06-30T08:03:29.474Z" level=info msg="Create events 201"
time="2021-06-30T08:03:31.003Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/workflow-working-fine-1625040000-957740190/labelPodCompleted
E0630 08:03:31.031922 1 leaderelection.go:301] Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io "workflow-controller": the object has been modified; please apply your changes to the latest version and try again
time="2021-06-30T08:03:31.477Z" level=info msg="stopped leading" id=workflow-controller-7c568955d7-bxtdj
E0630 08:03:31.640872 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 78 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1b6eb00, 0x2c9aa80)
/go/pkg/mod/k8s.io/apimachinery@v0.19.6/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/pkg/mod/k8s.io/apimachinery@v0.19.6/pkg/util/runtime/runtime.go:48 +0x89
panic(0x1b6eb00, 0x2c9aa80)
/usr/local/go/src/runtime/panic.go:969 +0x1b9
github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run.func2()
/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:256 +0x7c
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc0006547e0)
/go/pkg/mod/k8s.io/client-go@v0.19.6/tools/leaderelection/leaderelection.go:200 +0x29
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc0006547e0, 0x206c3e0, 0xc0002a48c0)
/go/pkg/mod/k8s.io/client-go@v0.19.6/tools/leaderelection/leaderelection.go:210 +0x15d
k8s.io/client-go/tools/leaderelection.RunOrDie(0x206c3e0, 0xc000522d40, 0x2078780, 0xc00085e000, 0x37e11d600, 0x2540be400, 0x12a05f200, 0xc000cb9c00, 0xc000a6a700, 0xc001096060, ...)
/go/pkg/mod/k8s.io/client-go@v0.19.6/tools/leaderelection/leaderelection.go:222 +0x9c
created by github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run
/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:241 +0xfb8
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x19cf4fc]
goroutine 78 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/pkg/mod/k8s.io/apimachinery@v0.19.6/pkg/util/runtime/runtime.go:55 +0x10c
panic(0x1b6eb00, 0x2c9aa80)
/usr/local/go/src/runtime/panic.go:969 +0x1b9
github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run.func2()
/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:256 +0x7c
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc0006547e0)
/go/pkg/mod/k8s.io/client-go@v0.19.6/tools/leaderelection/leaderelection.go:200 +0x29
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc0006547e0, 0x206c3e0, 0xc0002a48c0)
/go/pkg/mod/k8s.io/client-go@v0.19.6/tools/leaderelection/leaderelection.go:210 +0x15d
k8s.io/client-go/tools/leaderelection.RunOrDie(0x206c3e0, 0xc000522d40, 0x2078780, 0xc00085e000, 0x37e11d600, 0x2540be400, 0x12a05f200, 0xc000cb9c00, 0xc000a6a700, 0xc001096060, ...)
/go/pkg/mod/k8s.io/client-go@v0.19.6/tools/leaderelection/leaderelection.go:222 +0x9c
created by github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run
/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:241 +0xfb8
# Other than that is acting like this:
time="2021-06-30T11:59:37.960Z" level=info msg="template (node failing-workflow-1625018400-2046183375) active children parallelism reached 3/3" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.960Z" level=info msg="template (node failing-workflow-1625018400-2046183375) active children parallelism reached 3/3" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.961Z" level=info msg="template (node failing-workflow-1625018400-2046183375) active children parallelism reached 3/3" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.962Z" level=info msg="template (node failing-workflow-1625018400-2046183375) active children parallelism reached 3/3" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.963Z" level=info msg="template (node failing-workflow-1625018400-2046183375) active children parallelism reached 3/3" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.963Z" level=info msg="template (node failing-workflow-1625018400-2046183375) active children parallelism reached 3/3" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.964Z" level=info msg="template (node failing-workflow-1625018400-2046183375) active children parallelism reached 3/3" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.965Z" level=info msg="template (node failing-workflow-1625018400-2046183375) active children parallelism reached 3/3" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.965Z" level=info msg="template (node failing-workflow-1625018400-2046183375) active children parallelism reached 3/3" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.966Z" level=info msg="Patch events 200"
time="2021-06-30T11:59:37.966Z" level=info msg="template (node failing-workflow-1625018400-2046183375) active children parallelism reached 3/3" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.968Z" level=info msg="template (node failing-workflow-1625018400-2046183375) active children parallelism reached 3/3" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.968Z" level=info msg="template (node failing-workflow-1625018400-2046183375) active children parallelism reached 3/3" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.969Z" level=info msg="template (node failing-workflow-1625018400-2046183375) active children parallelism reached 3/3" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.970Z" level=info msg="template (node failing-workflow-1625018400-2046183375) active children parallelism reached 3/3" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.970Z" level=info msg="Workflow step group node failing-workflow-1625018400-1253377659 not yet completed" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.970Z" level=info msg="Workflow step group node failing-workflow-1625018400-1369709231 not yet completed" namespace=default workflow=failing-workflow-1625018400
time="2021-06-30T11:59:37.972Z" level=info msg="Patch events 200"
time="2021-06-30T11:59:37.977Z" level=info msg="Patch events 200"
time="2021-06-30T11:59:39.413Z" level=info msg="Get leases 200"
time="2021-06-30T11:59:39.420Z" level=info msg="Update leases 200"
time="2021-06-30T11:59:43.564Z" level=info msg="Enforcing history limit
# Pending pod main logs:
Logs everything as expected and finishes.
# Logs from pending workflow's wait container, something like:
time="2021-06-30T09:34:45.973Z" level=info msg="Starting Workflow Executor" version="{v3.0.3 2021-05-11T21:14:20Z 02071057c082cf295ab8da68f1b2027ff8762b5a v3.0.3 clean go1.
15.7 gc linux/amd64}"
time="2021-06-30T09:34:45.979Z" level=info msg="Creating a docker executor"
time="2021-06-30T09:34:45.979Z" level=info msg="Executor (version: v3.0.3, build_date: 2021-05-11T21:14:20Z) initialized (pod: default/failing-workflow-workfl
ow-1625018400-531201185) with template:\n{\"name\":\"fill-products\",\"inputs\":{\"parameters\":[{\"name\":\"products\",\"value\":\"products\"}]},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"i
mage\":\"us.gcr.io/spaceknow-backend/failing-workflow-workflow-base-delivery:287\",\"command\":[\"python3\"],\"args\":[\"deliveries/datacube-fill-products.py\
",\"--products\",\"products\"],\"envF
rom\":[{\"secretRef\":{\"name\":\"failing-workflow-workflow\"}}],\"resources\":{\"limits\":{\"memory\":\"2Gi\"},\"requests\":{\"cpu\":\"2\",\"memory\":\"2Gi\"
}}},\"retryStrategy\":{\"limit\":2,\"retryPolicy\":\"Always\"}}"
time="2021-06-30T09:34:45.979Z" level=info msg="Starting annotations monitor"
time="2021-06-30T09:34:45.979Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=
label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=failing-workflow-workflow-1625018400-531201185"
time="2021-06-30T09:34:45.980Z" level=info msg="Starting deadline monitor"
time="2021-06-30T09:34:46.039Z" level=info msg="mapped container name \"main\" to container ID \"bcd7bcab1bd8ab36553933c1b41cb81deacfbab26a9a440f36360aecef06af6f\" (created
at 2021-06-30 09:34:45 +0000 UTC, status Created)"
time="2021-06-30T09:34:46.039Z" level=info msg="mapped container name \"wait\" to container ID \"34831534a3aefb25a5744dfd102c8060dc2369767cab729b343f3b18d375828e\" (created
at 2021-06-30 09:34:45 +0000 UTC, status Up)"
time="2021-06-30T09:34:46.980Z" level=info msg="docker wait bcd7bcab1bd8ab36553933c1b41cb81deacfbab26a9a440f36360aecef06af6f"
time="2021-06-30T09:35:15.026Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=
label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=failing-workflow-workflow-1625018400-531201185"
time="2021-06-30T09:35:16.066Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=
label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=failing-workflow-workflow-1625018400-531201185"
time="2021-06-30T09:35:26.101Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=
label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=failing-workflow-workflow-1625018400-531201185"
time="2021-06-30T09:35:36.135Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=
label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=failing-workflow-workflow-1625018400-531201185"
time="2021-06-30T09:35:46.168Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=
label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=failing-workflow-workflow-1625018400-531201185"
time="2021-06-30T09:35:56.200Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=
label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=failing-workflow-workflow-1625018400-531201185"
time="2021-06-30T09:36:06.234Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=
label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=failing-workflow-workflow-1625018400-531201185"
time="2021-06-30T09:36:16.266Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=
label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=failing-workflow-workflow-1625018400-531201185"
time="2021-06-30T09:36:26.299Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=
label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=failing-workflow-workflow-1625018400-531201185"
time="2021-06-30T09:36:36.333Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=
label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=failing-workflow-workflow-1625018400-531201185"
time="2021-06-30T09:36:46.372Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=
label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=failing-workflow-workflow-1625018400-531201185"
time="2021-06-30T09:36:56.406Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=
label=io.kubernetes.pod.namespace=default --filter=label=io.kubernetes.pod.name=failing-workflow-workflow-1625018400-531201185"
time="2021-06-30T09:37:04.406Z" level=info msg="Main container completed"
time="2021-06-30T09:37:04.406Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2021-06-30T09:37:04.406Z" level=info msg="Capturing script exit code"
time="2021-06-30T09:37:04.438Z" level=info msg="No output parameters"
time="2021-06-30T09:37:04.438Z" level=info msg="No output artifacts"
time="2021-06-30T09:37:04.438Z" level=info msg="Annotating pod with output"
time="2021-06-30T09:37:04.454Z" level=info msg="Patch pods 200"
time="2021-06-30T09:37:04.464Z" level=info msg="Killing sidecars []"
time="2021-06-30T09:37:04.464Z" level=info msg="Alloc=5435 TotalAlloc=11825 Sys=73041 NumGC=4 Goroutines=10"
Message from the maintainers:
Impacted by this bug? Give it a đ. We prioritise the issues with the most đ.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 85
- Comments: 55 (32 by maintainers)
Hello,
We are on aks 1.20.5 with containerd and pns as executor with argo 3.1 and we experience the same problem Updating the workflowâs status via kubectl also timeout with the same message We did hotfix it by splitting the workflow by 2 and having a step submitting both workflows
I have updated K8S 1.19 -> 1.20 (Fix doesnât work in 1.19 yet) I have been able to successfully test atomic map-type fix! It seems to work well with test example and Argo workflow successfully finishes with 1000 nodes. I will have to do the further testing in our production run but I believe this will fix the issue. It might take a while until proper fix propagates to the k8s versions in cloud providers but this atomic workaround seems to be working as expected.
For anyone experiencing this issue you can try simply edit existing Argo CustomResourceDefinition and adding
x-kubernetes-map-type: atomic
to spec and status in openAPIV3Schema properties.@alexec
We have a very similar situation that we also started encountering when we went from Kubernetes 1.19 to Kubernetes 1.20.
After an upgrade to Kubernetes 1.20, we have hit upon a issue, where argo workflows with argo nodes fanout of more than about 100 were not getting their status updated properly. For example multiple argo nodes would have the erroneous status of âPendingâ in UI or cli while they were actively working and logging content and therefore should have been in a âRunningâ state. And upon completion again the nodes, did not register the correct state âCompletedâ either from the UI or from the argocli.
We saw this error in the logs for the workflow controller see below:
time="2021-06-23T14:18:15.798Z" level=warning msg="Error updating workflow: Timeout: request did not complete within requested timeout 34s Timeout" namespace=my-namespace workflow=my-workflow-v4f8t
The large workflows steps require a large amount of information to be sent to the Kubernetes API server to be stored in itâs datastore etcd. The Kubernetes end point would error on this transaction and fail to record the node statuses and the argo workflow controller would then not have the right information on subsequent requests. WIthout proper argo node status information the argo workflow controller does not advance the argo workflow to completion.
A solution has been found, at least temporarily. The solution requires that the argo workflow controller be started with the env var ALWAYS_OFFLOAD_NODE_STATUS set to true. This allows the node status information to be stored by the argo workflow controller directly in a postgres database instead of in etcd via the Kubernetes API.
However this change does require that the argocli now directly connect to the argo server to get information about workflows in the cluster. But even with a port forward to the argo server and setting the following env vars:
we are getting inconsistent results on the workflow status from the UI and argocli. The argo cli returns the following error on an
argo get
:â argo get my-workflow-wt967 -n my-namespace
FATA[0001] rpc error: code = Unknown desc = upper: no more rows in this result set
The workflows do however complete now though, but we canât view the workflows in the UI or get information on the nodes using the cli.
Aside: We suspect also the activation of flowschema in Kubernetes 1.20 as another contributor to this issue of the Kube API erroring - https://kubernetes.io/docs/concepts/cluster-administration/flow-control/ - since now the 'API Priority and Fairness â feature is always on. Because we are on GKE we cannot turn that feature back off. And creating a flowschema that allows the argo service account to be exempt from throttling seemed to have some positive impact (less Kube API server errors with large workflows) but did not fully resolve the issue.
UPDATE: More information on inconsistent error
FATA[0001] rpc error: code = Unknown desc = upper: no more rows in this result set
This seems to only happen while the workflow is not complete. Once complete we are able to view the workflow in the UI and with run
argo get
against the workflow using argo cli (with appropriate env vars set and port forwarding).yes. Same same.
Iâd expect weâd back-port the work-around to v3.1.3.
Youâll also need to update your argoexec image. That will fix the error.