moby: Swarm does not consider health checks while updating

Hi,

Im playing around with docker 1.12 and the HEALTHCHECK directive in dockerfiles. What I do is:

docker swarm init

Dockerfile

FROM nginx:latest

RUN touch /marker

ADD ./check_running.sh /check_running.sh
RUN chmod +x /check_running.sh

HEALTHCHECK --interval=5s --timeout=3s CMD ./check_running.sh

check_running.sh

#!/usr/bin/env bash

if [[ -e /marker ]]; then
    rm /marker
    exit 2
else
    exit 0
fi

Build some images from this Dockerfile:

$ docker build -t mynginx:1 .
$ docker build -t mynginx:2 .

Start a service with 5 replicas

$ docker service create --name web --replicas=5 -p 8080:80 mynginx:1

Wait some time (5s) until all replicas are healthy. Now I want to update the image to v2. The update procedure should be rolling with a parallelism of 1. So I run:

$ docker service update --image mynginx:2 --update-parallelism 1 web

From my understanding swarm should execute the update based on this algorithm:

swarm stops one container
swarm starts a container with mynginx:2 and waits as long as it becomes healthy again (not starting)
go to 1)

But what I get is:

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED                  STATUS                                     PORTS               NAMES
eee2b1b184ac        mynginx:2           "nginx -g 'daemon off"   Less than a second ago   Up Less than a second (health: starting)   80/tcp, 443/tcp     web.3.6kn82dul8gnyvzgz6cje2byfj
b3797cdb5e05        mynginx:2           "nginx -g 'daemon off"   3 seconds ago            Up 2 seconds (health: starting)            80/tcp, 443/tcp     web.5.46ll6b2lhrhr98utowuifccei
0c85128b031e        mynginx:2           "nginx -g 'daemon off"   6 seconds ago            Up 5 seconds (health: starting)            80/tcp, 443/tcp     web.1.apenirwkjp96xiyhhspx134wx
9dcec24ae32a        mynginx:2           "nginx -g 'daemon off"   9 seconds ago            Up 8 seconds (health: starting)            80/tcp, 443/tcp     web.2.5oi6fk1c6u25s3rpqbal39i3b
f37332162d40        mynginx:2           "nginx -g 'daemon off"   12 seconds ago           Up 11 seconds (healthy)                    80/tcp, 443/tcp     web.4.7mgs62e6s78iijcqwv4cose11

Sometimes all replicas are in starting state which results in a downtime of this service. My expectation was that I can obtain “native” zero downtime deployment with HEALTHCHECK and rolling updates on my swarm cluster 😃

Whats wrong with my attempt?

Thanks!

Update: Sorry for closing/opening this issue. My mobile is somehow broken today 😕

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 18 (11 by maintainers)

Most upvoted comments

@runshenzhu Yeah this is my non native workaround. But it is very hard to predict bootstrap time in some circumstances, so real checking would be much better.

otbe on Jun 27, 2016

It is not mentioned explicitly in the docs, but it was my expectation. 😃

When exiting with 1 it works for me. Swarm will reschedule unhealthy containers.

This is my most wanted feature because it allows native zero downtime deployments for the first time.

otbe on Jun 27, 2016