argo-workflows: Invalid node IDs in workflow.status
Pre-requisites
- I have double-checked my configuration
- I can confirm the issues exists when I tested with
:latest
- I’d like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
Since #8748 pod names use the V2 naming which contains the template name in the pod name. However the implementation did not update the Workflow.Status.Nodes map to contain the correct pod name anymore. There’s a disconnect between NodeIDs and pod names which wasn’t the case before. This makes it impossible to look at a argo workflow status, take the nodeID and use it to know which pod it belongs to.
This is a follow-up of #9906. I initially thought this would be the same case but apparently is not.
The below workflow triggers the following workflow:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
annotations:
workflows.argoproj.io/pod-name-format: v2
creationTimestamp: "2022-11-25T11:33:41Z"
generateName: nodename-
generation: 3
labels:
workflows.argoproj.io/phase: Running
name: nodename-bvd45
namespace: argo
resourceVersion: "15649"
uid: ea233eef-210d-4394-a238-ef847b104458
spec:
activeDeadlineSeconds: 300
arguments: {}
entrypoint: render
podSpecPatch: |
terminationGracePeriodSeconds: 3
templates:
- inputs: {}
metadata: {}
name: render
outputs: {}
steps:
- - arguments:
parameters:
- name: frames
value: '{{item.frames}}'
name: run-blender
template: blender
withItems:
- frames: 1
- container:
args:
- /argosay echo 0/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s && /argosay
echo 50/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s
command:
- /bin/sh
- -c
image: argoproj/argosay:v2
name: ""
resources: {}
inputs:
parameters:
- name: frames
metadata: {}
name: blender
outputs: {}
retryStrategy:
limit: 2
retryPolicy: Always
status:
artifactGCStatus:
notSpecified: true
artifactRepositoryRef:
artifactRepository:
archiveLogs: true
s3:
accessKeySecret:
key: accesskey
name: my-minio-cred
bucket: my-bucket
endpoint: minio:9000
insecure: true
secretKeySecret:
key: secretkey
name: my-minio-cred
configMap: artifact-repositories
key: default-v1
namespace: argo
conditions:
- status: "False"
type: PodRunning
finishedAt: null
nodes:
nodename-bvd45:
children:
- nodename-bvd45-701773242
displayName: nodename-bvd45
finishedAt: null
id: nodename-bvd45
name: nodename-bvd45
phase: Running
progress: 0/1
startedAt: "2022-11-25T11:33:41Z"
templateName: render
templateScope: local/nodename-bvd45
type: Steps
nodename-bvd45-701773242:
boundaryID: nodename-bvd45
children:
- nodename-bvd45-3728066428
displayName: '[0]'
finishedAt: null
id: nodename-bvd45-701773242
name: nodename-bvd45[0]
phase: Running
progress: 0/1
startedAt: "2022-11-25T11:33:41Z"
templateScope: local/nodename-bvd45
type: StepGroup
nodename-bvd45-3728066428:
boundaryID: nodename-bvd45
children:
- nodename-bvd45-3928099255
displayName: run-blender(0:frames:1)
finishedAt: null
id: nodename-bvd45-3728066428
inputs:
parameters:
- name: frames
value: "1"
name: nodename-bvd45[0].run-blender(0:frames:1)
phase: Running
progress: 0/1
startedAt: "2022-11-25T11:33:41Z"
templateName: blender
templateScope: local/nodename-bvd45
type: Retry
nodename-bvd45-3928099255:
boundaryID: nodename-bvd45
displayName: run-blender(0:frames:1)(0)
finishedAt: null
hostNodeName: k3d-argowf-server-0
id: nodename-bvd45-3928099255
inputs:
parameters:
- name: frames
value: "1"
message: PodInitializing
name: nodename-bvd45[0].run-blender(0:frames:1)(0)
phase: Pending
progress: 0/1
startedAt: "2022-11-25T11:33:41Z"
templateName: blender
templateScope: local/nodename-bvd45
type: Pod
phase: Running
progress: 0/1
startedAt: "2022-11-25T11:33:41Z"
The pod to run is named nodename-bvd45-blender-3928099255
but the NodeID
in workflow.status.nodes
is just nodename-bvd45-3928099255
.
Version
latest
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don’t enter a workflows that uses private images.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: nodename-
spec:
arguments: {}
entrypoint: render
templates:
- inputs: {}
metadata: {}
name: render
steps:
- - arguments:
parameters:
- name: frames
value: '{{item.frames}}'
name: run-blender
template: blender
withItems:
- frames: 1
- container:
image: argoproj/argosay:v2
command: ["/bin/sh", "-c"]
args:
- /argosay echo 0/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s && /argosay echo 50/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s
name: ""
inputs:
parameters:
- name: frames
name: blender
retryStrategy:
limit: 2
retryPolicy: Always
Logs from the workflow controller
irrelevant
Logs from in your workflow’s wait container
irrelevant
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 5
- Comments: 22 (13 by maintainers)
Let’s reopen this since this blocks upgrade for KFP.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
+1 on what @mweibel said. This describes the issue precisely. This regression is blocking us from upgrading to v3.4.
@alexec We have been relying on node id being equal to pod name for the past few years until we recently try to upgrade from v3.3.8 to v3.4.7 and hit this issue. So this is a breaking change at minimum. Is there any reason why we can’t or shouldn’t use pod name as node id?
Also, if I’m not mistaken, workflow.status would contain pod name only when there’s a failure case. If node id is different from pod name, then how can we get the pod name for non-failure case?
Thanks in advance!
We use Argo heavily at ZG and when upgrading to
v3.4
of argo workflows we noticed breaking changes this causes outside of thenodeId
in the workflow. This v2 naming convention also breaks upstream k8sHOSTNAME
env variable for the pod. For instance, we get theHOSTNAME
in the workflow pod and run the kubernetes api call toget_namespace_pod
with thatHOSTNAME
the pod name returned from the kubernetes api server does not match the pod name in the actual podmetadata.name
. Not sure if there is some weird concatenation going on that is not persisting to etcd but the downard API does not match what is in etcd. I reverted the env varPOD_NAMES
tov1
and everything works inv3.4
. I feel with all the bugs thisv2
pod name should be reverted becuase the scope of breaking changes persists beyond argo itself and into kubernetes.related to #10267
@alexec #10124 done 😃
Ah. That is a bug. Can you raise a separate issue?
Node IDs are opaque identifiers, not pod names. Not a bug.