moby: Usage of max-replicas-per-node not compatible with start-first update_config

Description

Usage of max-replicas-per-node is not compatible when using order: start-first in update_config. The maximum number of replicas prevents the start of the new replacement containers.

Steps to reproduce the issue:

Define a docker-compose.yml with max-replicas-per-node constraint:

version: "3.8"

services:
  web:
    image: nginx:1.16
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.labels.role == web]
        max_replicas_per_node: 1
      update_config:
        parallelism: 1
        order: start-first
        failure_action: rollback
        delay: 10s

Deploy stack to swarm

$ docker stack deploy -c docker-compose.yml repro_max_replicas_bug
Creating network repro_max_replicas_bug_default
Creating service repro_max_replicas_bug_web

$ docker service ls
ID                  NAME                         MODE                REPLICAS               IMAGE               PORTS
v2yd365lmwvb        repro_max_replicas_bug_web   replicated          1/1 (max 1 per node)   nginx:1.16

Update the image version in the docker-compose.yml and deploy the stack again:

$ sed -i 's/nginx:1.16/nginx:1.17/' docker-compose.yml
$ docker stack deploy -c docker-compose.yml repro_max_replicas_bug
Updating service repro_max_replicas_bug_web (id: v2yd365lmwvbbu0i4r0g6f026)

Verify deployment status with docker service ps

$ docker service ls
ID                  NAME                         MODE                REPLICAS               IMAGE               PORTS
v2yd365lmwvb        repro_max_replicas_bug_web   replicated          1/1 (max 1 per node)   nginx:1.17          

$ docker service ps v2yd365lmwvb --no-trunc
ID                          NAME                               IMAGE                                                                                NODE                DESIRED STATE       CURRENT STATE                ERROR                                                     PORTS
qaheo9ufk24tged4biq9hlvfo   repro_max_replicas_bug_web.1       nginx:1.17@sha256:282530fcb7cd19f3848c7b611043f82ae4be3781cb00105a1d593d7e6286b596                       Running             Pending 32 seconds ago       "no suitable node (max replicas per node limit exceed)"   
n0z0c2dpt4kih8x91h8vm2e35    \_ repro_max_replicas_bug_web.1   nginx:1.16@sha256:8723f69d18865756716b1b6a7cebae0107c39c7ad9b9b310875a3a0a5be235a1   aldebaran           Running             Running about a minute ago

Describe the results you received:

The new container that runs with a newer image is not started and doesn’t replace the old container.

Describe the results you expected:

The new container is started and replaces the old one.

Additional information you deem important (e.g. issue happens only occasionally):

This issue happens every time on different servers.

Output of docker version:

Client:
 Version:           19.03.6
 API version:       1.40
 Go version:        go1.13.8
 Git commit:        369ce74
 Built:             Wed, 26 Feb 2020 11:20:11 +1100
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.6
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.8
  Git commit:       369ce74 
  Built:            Wed Feb 26 00:20:11 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          19.03.6
  GitCommit:        7c1e88399ec0b0b077121d9d5ad97e647b11c870
 runc:
  Version:          1.0.0~rc10+dfsg1
  GitCommit:        1.0.0~rc10+dfsg1-1
 docker-init:
  Version:          0.18.0
  GitCommit:

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 10
  Running: 9
  Paused: 0
  Stopped: 1
 Images: 52
 Server Version: 19.03.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: mwh8whayw7caqnr543ypvgmgt
  Is Manager: true
  ClusterID: bx8s8ubbmvevmcpsc3fnsamdo
  Managers: 1
  Nodes: 1
  Default Address Pool: 10.0.0.0/8  
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 192.168.1.160
  Manager Addresses:
   192.168.1.160:2377
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7c1e88399ec0b0b077121d9d5ad97e647b11c870
 runc version: 1.0.0~rc10+dfsg1-1
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-4-amd64
 Operating System: Debian GNU/Linux bullseye/sid
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 15.58GiB
 Name: aldebaran
 ID: DB6L:27Z6:HDCS:XJND:WUFH:UZ5R:53TZ:COAN:PMMP:X75E:DFBE:3IOI
 Docker Root Dir: /home/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

Verified on:

Physical machine:

$ lsb_release -a
Distributor ID:	Debian
Description:	Debian GNU/Linux bullseye/sid
Release:	unstable
Codename:	sid

Google Cloud VM:

$ lsb_release -a
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.3 LTS
Release:	18.04
Codename:	bionic

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 1
Comments: 16 (5 by maintainers)

Most upvoted comments

In my opinion, the max-replicas-per-node option should not prevent services from updating. We should distinguish between intented behavior and what’s needed to get the intented status running. I think, everyone would bindly accept if it just worked that way and that it’s absolutely logical that it would result in temporarily more services per host when using start_first

Why should I spare system power just to make the update work? System power costs money.

At least, consider something like a surge flag for that.

  update_config:
    order: start-first
    max_replicas_surge: true
  placement:
    max_replicas_per_node: 10

sgohl on Nov 24, 2022

I still think it’s not much more than an ignore/skip check at the right place in the code

With very quick look, who ever will be implementing this need:

Add new field to API in https://github.com/moby/swarmkit/blob/master/api/types.proto#L446
Generate updated version of API like described in https://github.com/moby/swarmkit/blob/master/api/README.md and get it right you want to use exactly same version of protoc like used by CI https://github.com/moby/swarmkit/blob/294d56efc21ecd7fb185eef7e551e725aa628c50/Dockerfile#L11-L17
Add that updated logic to https://github.com/moby/swarmkit/blob/master/manager/scheduler/filter.go#L380
Update test cases to include this use case.

Original PR can be found from: https://github.com/moby/swarmkit/pull/2758

If you need more tips/help with that work you can find me from Docker community Slack.

olljanat on Nov 24, 2022

I like that surge idea. For the sake of possibly getting that feature in this century, I’d suggest this as a flag for the time being. Hence I updated my original answer to use this terminology (surge)

https://github.com/moby/moby/issues/40797#issuecomment-1289995338

I still think it’s not much more than an ignore/skip check at the right place in the code

sgohl on Nov 24, 2022

We too stumbled across this issue with max-replicas-per-node.

Perhaps a sensible way to go about this would be a mechanism similar to k8s’ (sorry) maxSurge functionality. It essentially allows replicas to surge above the defined replica number during deployment rollouts, thus allows a solution without causing issues with start-first.

kuritonasu on Nov 18, 2022

@sgohl @GCSBOSS to be honest I don’t understand what is problem with default stop-first option? In fault tolerant system you need to anyway have at least 2 replicas of each application which are running on two different nodes which why you also can update those one by one and application is up all the time.

Just make sure that you have included:

update_config:
  parallelism: 1

to your config.

If your applicantion is slow to start then also make sure that you use reasonable delay between updates. Also longer healthcheck start time can help.

When it comes to @sgohl proposal. It is reasonable but someone need take action and implement it if you want to see it or you need to have support contract with Mirantis and ask them to do so. That is because there is no active feature development happening in swarmkit anymore by Docker Inc.

olljanat on Nov 1, 2022

Perhaps someone is interested in contributing to the documentation to describe this scenario (i.e., with start-first, the swarm cluster must have a node that has < max-replicas-per-node for the service in use, to start the new instance before stopping the old one).

thaJeztah on Jun 24, 2021