argo-workflows: nodeAntiAffinity is not working as expected.
Summary
What happened/what you expected to happen?
I wanted to use retryStrategy
with nodeAntiAffinity
in order to prevent retrials from running on the same hosts. I was using the following small workflow for testing, but what happens is all retrials were started on the same host (node).
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: random-fail-
spec:
entrypoint: random
templates:
- name: random
retryStrategy:
limit: 10
retryPolicy: "Always"
affinity:
nodeAntiAffinity: {}
script:
image: python:alpine3.6
command: [python]
source: |
import random
import time
random.seed(time.time())
i = random.randint(0, 10)
print(i)
exit(i)`
Not sure if this is an expected behavior and in my case i should use RetryNodeAntiAffinity
which is, as mentioned in the documentation, is a placeholder for future expansion.
What version are you running? Tested with v3.2.9 and v3.3.3
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 11
- Comments: 26 (12 by maintainers)
Bumping this to hopefully get some headway (also, don’t want it to be marked stale again. lol)
I run the example from the ticket, and from above 30K feet, it looks like the controller stopped adding
Affinity
to the pod’s spec, or it’s removed at some point, I’ll start debugging and find the cause.@goock could you look into this pls?
Any news on this ? 🙄
@caelan-io I would like to finish this ticket. ATM I’m busy but probably in the next 2-3 weeks I can find a spare time to work on this. The crucial part of this ticket and implementation is to find an efficient way to find the retry node as in my original question https://github.com/argoproj/argo-workflows/issues/9193#issuecomment-1241901976. Thanks.
@alexec would you be able to lend us your thoughts on the below questions @goock proposed above?
@alexec I started working on this. In this function https://github.com/argoproj/argo-workflows/blob/9d66b69f0bca92d7ef0c9aa67e87a2e334797530/workflow/controller/retry_tweak.go#L15 I’m taking the template from the boundary node and then finding the retry node with the same template, which works for more complicated scenarios but does not work for very simple ones from this ticket - the
BoundaryID
is empty.I see four possible solutions to the issue, but I need advice on this:
Modify
FindRetryNode
; if theBoundaryID
is empty, take the template from the node and then find the retry node. I would need confirmation ifTemplateName
is not empty in all cases or at least whenBoundaryID
is empty.Create another traverse function (like this one https://github.com/argoproj/argo-workflows/blob/master/workflow/util/retry/retry.go#L10), walk over the nodes tree and find the retry node. This solution would be clumsy and inefficient.
Add a
Parent
field toNodeStatus struct
and just walk the tree upward to find the retry node. It makes everything super easy, but I’m unsure of possible drawbacks and if we want to introduce another field to theNodeStatus
struct.Another way to find the retry node for the current one which I am not aware of.