argo-workflows: nodeAntiAffinity is not working as expected.

Summary

What happened/what you expected to happen?

I wanted to use retryStrategy with nodeAntiAffinity in order to prevent retrials from running on the same hosts. I was using the following small workflow for testing, but what happens is all retrials were started on the same host (node).

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: random-fail-
spec:
  entrypoint: random
  templates:
  - name: random
    retryStrategy:
      limit: 10
      retryPolicy: "Always"
      affinity:
        nodeAntiAffinity: {}
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        import random
        import time
        random.seed(time.time())
        i = random.randint(0, 10)
        print(i)
        exit(i)`

Not sure if this is an expected behavior and in my case i should use RetryNodeAntiAffinity which is, as mentioned in the documentation, is a placeholder for future expansion.

What version are you running? Tested with v3.2.9 and v3.3.3


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 11
  • Comments: 26 (12 by maintainers)

Most upvoted comments

Bumping this to hopefully get some headway (also, don’t want it to be marked stale again. lol)

@goock could you look into this pls?

I run the example from the ticket, and from above 30K feet, it looks like the controller stopped adding Affinity to the pod’s spec, or it’s removed at some point, I’ll start debugging and find the cause.

@goock could you look into this pls?

Any news on this ? 🙄

@caelan-io I would like to finish this ticket. ATM I’m busy but probably in the next 2-3 weeks I can find a spare time to work on this. The crucial part of this ticket and implementation is to find an efficient way to find the retry node as in my original question https://github.com/argoproj/argo-workflows/issues/9193#issuecomment-1241901976. Thanks.

@alexec would you be able to lend us your thoughts on the below questions @goock proposed above?

I see four possible solutions to the issue, but I need advice on this:

  1. Modify FindRetryNode; if the BoundaryID is empty, take the template from the node and then find the retry node. I would need confirmation if TemplateName is not empty in all cases or at least when BoundaryID is empty.

  2. Create another traverse function (like this one https://github.com/argoproj/argo-workflows/blob/master/workflow/util/retry/retry.go#L10), walk over the nodes tree and find the retry node. This solution would be clumsy and inefficient.

  3. Add a Parent field to NodeStatus struct and just walk the tree upward to find the retry node. It makes everything super easy, but I’m unsure of possible drawbacks and if we want to introduce another field to the NodeStatus struct.

  4. Another way to find the retry node for the current one which I am not aware of.

@alexec I started working on this. In this function https://github.com/argoproj/argo-workflows/blob/9d66b69f0bca92d7ef0c9aa67e87a2e334797530/workflow/controller/retry_tweak.go#L15 I’m taking the template from the boundary node and then finding the retry node with the same template, which works for more complicated scenarios but does not work for very simple ones from this ticket - the BoundaryID is empty.

I see four possible solutions to the issue, but I need advice on this:

  1. Modify FindRetryNode; if the BoundaryID is empty, take the template from the node and then find the retry node. I would need confirmation if TemplateName is not empty in all cases or at least when BoundaryID is empty.

  2. Create another traverse function (like this one https://github.com/argoproj/argo-workflows/blob/master/workflow/util/retry/retry.go#L10), walk over the nodes tree and find the retry node. This solution would be clumsy and inefficient.

  3. Add a Parent field to NodeStatus struct and just walk the tree upward to find the retry node. It makes everything super easy, but I’m unsure of possible drawbacks and if we want to introduce another field to the NodeStatus struct.

  4. Another way to find the retry node for the current one which I am not aware of.