kubernetes: Rolling upgrade of deployment has conflict with pod anti-affinity policy

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: I have a kube-dns deployment to create multiple dns pods on each node. Use pod anti-affinity policy to make sure dns pods can be spreaded over cluster nodes.

  selector:
    matchLabels:
      k8s-app: kube-dns
  template:
    metadata:
      labels:
        k8s-app: kube-dns
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: k8s-app
                operator: In
                values:
                - kube-dns
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: kubedns

Then I want to upgrade my kube-dns to the latest version 1.14.4. The new pod is in Pending state forever because of no node is selected for the pod because of pod anti-affinity policy.

What you expected to happen: The rolling upgrade process of kube-dns deployment should be completed as expected.

How to reproduce it (as minimally and precisely as possible):

  1. Create deployment with pod anti-affinity set.
  2. Upgrade the deployment.

Anything else we need to know?:

Environment: None

  • Kubernetes version (use kubectl version): v1.8.3
  • Cloud provider or hardware configuration: None
  • OS (e.g. from /etc/os-release): Ubuntu 16.04
  • Kernel (e.g. uname -a):
  • Install tools: self-defined
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 15
  • Comments: 21 (5 by maintainers)

Commits related to this issue

Most upvoted comments

It would be good if k8s was aware of the difference between new & old versions when considering preferred scheduling as part of a rolling deployment. It doesn’t do much good now when cluster size ~= number of pod replicas.

I am having same problem with preferredDuringSchedulingIgnoredDuringExecution. After rolling update of a deployment, podaffinity is not longer respected. I want deployment to be spread across many nodes (not necessarily one pod per node) to achieve high-availability. Rolling update skewes pod anti affinity in such a way that sometimes I end up having all pods scheduled on one node. This invalidates one of the use-cases of pod anti affinity - high availability with rolling updates.

My deployment (helm chart):

spec:
  replicas: {{ .Values.replicaCount }}
  strategy:
    rollingUpdate:
      maxSurge: 50%
      maxUnavailable: 10%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: {{ template "name" . }}
        release: {{ .Release.Name }}
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - {{ template "name" . }}
              topologyKey: kubernetes.io/hostname

ReplicaCount is a variable and can be changed from deployment to deployment.

@k82cn I have tried the trick with maxUnavailable: 1 and maxSurge: 0 but it doesn’t work as expected. still ending up in non-evenly spread pods.

Did you try strategy. rollingUpdate.maxUnavailable: 1? Kill pod first when doing rolling upgrade.

/sig apps

/kind feature

This can be workaround by adding another label to the deployment file and use the new added label to do anti-affinity.

@ErikLundJensen you speak here about different regions… we need to understand what can be done if you have a simple case of 1 region, when the anti-affinity is conflicting the rolling upgrade deployment