argo-rollouts: Promote-full with traffic splitting doesn't wait for new pods to be ready

Summary

When using the promote-full option with traffic splitting, 100% of traffic is immediately sent to the new ReplicaSet, even if it is not scaled up enough to handle the traffic.

Instead, it should immediately scale up the new RS to full size, but wait to adjust the traffic split as the new pods become ready. I had thought this was the difference between set weight and actual weight, but that doesn’t seem to be the case (maybe someone could help me understand what the difference is then?)

As a side effect when using the new dynamicStableScale feature in 1.1, this means that the old RS gets immediately scaled down and we could be left with very few running/ready pods. I think this is a symptom of the above root cause though, and I guess the traffic split is set to send all traffic to the new RS anyway.

Diagnostics

Rollouts 1.1


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

discussed with @jessesuen. Fix will be to wait for all pods to come up and then update the weight. in the following block, we need to have another check for promote-full

if !atDesiredReplicaCount && !promoteFull {
			// Use the previous weight since the new RS is not ready for a new weight
			for i := *index - 1; i >= 0; i-- {
				step := c.rollout.Spec.Strategy.Canary.Steps[i]
				if step.SetWeight != nil {
					desiredWeight = *step.SetWeight
					break
				}
			}

Without dynamicStableScale, old replica isn’t scaling down immediately. That is an expected behavior