argo-workflows: Invalid node IDs in workflow.status

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I’d like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Since #8748 pod names use the V2 naming which contains the template name in the pod name. However the implementation did not update the Workflow.Status.Nodes map to contain the correct pod name anymore. There’s a disconnect between NodeIDs and pod names which wasn’t the case before. This makes it impossible to look at a argo workflow status, take the nodeID and use it to know which pod it belongs to.

This is a follow-up of #9906. I initially thought this would be the same case but apparently is not.

The below workflow triggers the following workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    workflows.argoproj.io/pod-name-format: v2
  creationTimestamp: "2022-11-25T11:33:41Z"
  generateName: nodename-
  generation: 3
  labels:
    workflows.argoproj.io/phase: Running
  name: nodename-bvd45
  namespace: argo
  resourceVersion: "15649"
  uid: ea233eef-210d-4394-a238-ef847b104458
spec:
  activeDeadlineSeconds: 300
  arguments: {}
  entrypoint: render
  podSpecPatch: |
    terminationGracePeriodSeconds: 3
  templates:
  - inputs: {}
    metadata: {}
    name: render
    outputs: {}
    steps:
    - - arguments:
          parameters:
          - name: frames
            value: '{{item.frames}}'
        name: run-blender
        template: blender
        withItems:
        - frames: 1
  - container:
      args:
      - /argosay echo 0/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s && /argosay
        echo 50/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s
      command:
      - /bin/sh
      - -c
      image: argoproj/argosay:v2
      name: ""
      resources: {}
    inputs:
      parameters:
      - name: frames
    metadata: {}
    name: blender
    outputs: {}
    retryStrategy:
      limit: 2
      retryPolicy: Always
status:
  artifactGCStatus:
    notSpecified: true
  artifactRepositoryRef:
    artifactRepository:
      archiveLogs: true
      s3:
        accessKeySecret:
          key: accesskey
          name: my-minio-cred
        bucket: my-bucket
        endpoint: minio:9000
        insecure: true
        secretKeySecret:
          key: secretkey
          name: my-minio-cred
    configMap: artifact-repositories
    key: default-v1
    namespace: argo
  conditions:
  - status: "False"
    type: PodRunning
  finishedAt: null
  nodes:
    nodename-bvd45:
      children:
      - nodename-bvd45-701773242
      displayName: nodename-bvd45
      finishedAt: null
      id: nodename-bvd45
      name: nodename-bvd45
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: render
      templateScope: local/nodename-bvd45
      type: Steps
    nodename-bvd45-701773242:
      boundaryID: nodename-bvd45
      children:
      - nodename-bvd45-3728066428
      displayName: '[0]'
      finishedAt: null
      id: nodename-bvd45-701773242
      name: nodename-bvd45[0]
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateScope: local/nodename-bvd45
      type: StepGroup
    nodename-bvd45-3728066428:
      boundaryID: nodename-bvd45
      children:
      - nodename-bvd45-3928099255
      displayName: run-blender(0:frames:1)
      finishedAt: null
      id: nodename-bvd45-3728066428
      inputs:
        parameters:
        - name: frames
          value: "1"
      name: nodename-bvd45[0].run-blender(0:frames:1)
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: blender
      templateScope: local/nodename-bvd45
      type: Retry
    nodename-bvd45-3928099255:
      boundaryID: nodename-bvd45
      displayName: run-blender(0:frames:1)(0)
      finishedAt: null
      hostNodeName: k3d-argowf-server-0
      id: nodename-bvd45-3928099255
      inputs:
        parameters:
        - name: frames
          value: "1"
      message: PodInitializing
      name: nodename-bvd45[0].run-blender(0:frames:1)(0)
      phase: Pending
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: blender
      templateScope: local/nodename-bvd45
      type: Pod
  phase: Running
  progress: 0/1
  startedAt: "2022-11-25T11:33:41Z"

The pod to run is named nodename-bvd45-blender-3928099255 but the NodeID in workflow.status.nodes is just nodename-bvd45-3928099255.

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don’t enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: nodename-
spec:
  arguments: {}
  entrypoint: render
  templates:
    - inputs: {}
      metadata: {}
      name: render
      steps:
        - - arguments:
              parameters:
                - name: frames
                  value: '{{item.frames}}'
            name: run-blender
            template: blender
            withItems:
              - frames: 1
    - container:
        image: argoproj/argosay:v2
        command: ["/bin/sh", "-c"]
        args:
          - /argosay echo 0/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s && /argosay echo 50/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s
        name: ""
      inputs:
        parameters:
          - name: frames
      name: blender
      retryStrategy:
        limit: 2
        retryPolicy: Always

Logs from the workflow controller

irrelevant

Logs from in your workflow’s wait container

irrelevant

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 5
  • Comments: 22 (13 by maintainers)

Most upvoted comments

Let’s reopen this since this blocks upgrade for KFP.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

Since https://github.com/argoproj/argo-workflows/pull/8748 pod names use the V2 naming which contains the template name in the pod name. However the implementation did not update the Workflow.Status.Nodes map to contain the correct pod name anymore. There’s a disconnect between NodeIDs and pod names which wasn’t the case before. This makes it impossible to look at a argo workflow status, take the nodeID and use it to know which pod it belongs to.

+1 on what @mweibel said. This describes the issue precisely. This regression is blocking us from upgrading to v3.4.

Node IDs are opaque identifiers, not pod names. Not a bug.

@alexec We have been relying on node id being equal to pod name for the past few years until we recently try to upgrade from v3.3.8 to v3.4.7 and hit this issue. So this is a breaking change at minimum. Is there any reason why we can’t or shouldn’t use pod name as node id?

Also, if I’m not mistaken, workflow.status would contain pod name only when there’s a failure case. If node id is different from pod name, then how can we get the pod name for non-failure case?

Thanks in advance!

We use Argo heavily at ZG and when upgrading to v3.4 of argo workflows we noticed breaking changes this causes outside of the nodeId in the workflow. This v2 naming convention also breaks upstream k8s HOSTNAME env variable for the pod. For instance, we get the HOSTNAME in the workflow pod and run the kubernetes api call to get_namespace_pod with that HOSTNAME the pod name returned from the kubernetes api server does not match the pod name in the actual pod metadata.name. Not sure if there is some weird concatenation going on that is not persisting to etcd but the downard API does not match what is in etcd. I reverted the env var POD_NAMES to v1 and everything works in v3.4. I feel with all the bugs this v2 pod name should be reverted becuase the scope of breaking changes persists beyond argo itself and into kubernetes.

related to #10267

Ah. That is a bug. Can you raise a separate issue?

Node IDs are opaque identifiers, not pod names. Not a bug.