moby: trigger restart from unhealthy status

what

It would be helpful if an action could be triggered when a service status is unhealthy.

why

It would be sane trigger an action for a container when a status is unhealthy. Tying this into the restart policy would make sense.

Current alternatives appear to be listening to event changes from a third part script. That doesn’t support simple scaling.

ideal configuration

combining

  • HEALTHCHECK command to toggle the health status
  • restart argument passed with on-failure:N that will restart the container once failStreak is N

About this issue

  • Original URL
  • State: open
  • Created 8 years ago
  • Reactions: 256
  • Comments: 47 (12 by maintainers)

Commits related to this issue

Most upvoted comments

There are many uses of Docker that do not include Swarm/K8s and still require healthchecks and a way to recover from failing ones.

Now that docker has it’s own healthcheck this should be improved full circle.

Please add a way for docker to restart a container if it’s own healthcheck fails.

If you want something that is going to take action on a health check, that is swarm.

You can run swarm in a single node context.

Again… this is litterally what swarm does. “I don’t want to use swarm” == “I don’t want healthchecks to trigger automated actions”

This is an obvious and sorely needed feature that was included in the original HEALTHCHECK proposal but then dropped and replaced by nothing. The proposed restart option was rejected as not being flexible enough, but it seems it is exactly what many are looking to do.

Ohh, 2 years passed by.

To have at least some progress: If we can not get “–exit-on-unhealthy + auto-restart policy” at the same time we should get at least one improvement.

Next step should be to implement and merge --exit-on-unhealthy, as default behaviour on “unhealthy” containers. This would avoid at least still waisting compute resources on a unhealthy container (which is known by the healthcheck as stopped working). Green IT you know… 😉

If someone do not want this to happen it should be possible to set “–exit-on-unhealthy=false”.

Ideological answers are not very useful when people ask for pragmatic solutions to real-life problems.

If Docker did not support any restarting functionality, and provided a healtcheck command only executed upon an external agent query, then sure, one could argue it is not the container’s business to handle restarting or unhealthy logic. But Docker has restarting rules and relatively complex logic for health status determination, it is only missing one small feature to tie them together.

The already proposed --exit-on-unhealthy option is logically consistent with the healthcheck feature, without implying restart or orchestration. Restart on unhealthy may not always be the desired function.

Some people see a solution, others just see problems. As an engineer of mechanics, electronics and software my opinion is, that most issues have nothing todo with technology itself but with the will to come up with a solution. Because people wanted to find solutions, humanity built pyramides, airplanes, rockets and everything else. It is not that the arguments above like code complexity and so on are invalid. It is just that there is always problems to solve.

5 years for this issue - still not resolved. Hopefully you docker people learn to do your homework: Take stuff more serious from the people who use the software. I wish you to do better in this area in the future.

Isn’t this already dealt with in docker-swarm! Unhealthy containers are shutdown and a new one is spawned.

If you are just using docker or docker-compose you can add a cron job. At least, I am doing it this way.

2 * * * * docker restart  $(docker ps | grep unhealthy | cut -c -12) 2>/dev/null 

Personally I think orchestration engine should handle the restarting and not the docker itself.

I don’t understand why this ability isn’t already apart of the existing restart policy? As a regular user of docker, I would have expected an additional restart: on-unhealthy – in complement to on-failure. Hijacking the healthcheck with || exit 1 is honestly hacking a on-failure/always/unless-stopped restart.

If you wanted to keep all this within the orchestration layer, then that’s where restart should even exist as well. A failed container should stay failed, marked as failed vs just exit for the orchestration software to pick it up and attempt to restart it. Now, if it’s of the mindset being “You’re right, but we already have restart so we’re stuck with it” then why not add an on-healthy addition to resolve this very specific request, while keeping the more complex restart policies within the orchestration layer.

This implementation shouldn’t be a huge deal for the developer, but one giant leap for mankind.

as podman has this functionality and podman-compose is now barely even with docker-compose, yes its time to leave docker and join the podman familiy.

@rishiloyola, you can use autoheal as a workaround.

@thaJeztah The failing streak is the number of consecutive failed health probes. As soon as a probe succeeds, the count is reset to zero.

The behaviour suggested above was in the original healthcheck PR, but was rejected (https://github.com/docker/docker/pull/22719, --exit-on-unhealthy + auto-restart policy).

Have been watching this issue for sometime now, and initially I had upvoted. At the time, with little experience on containers. I just now retracted my upvote to agree with Brian (@cpuguy83). What seems like a simple feature goes a lot deeper than just running a curl script for basic fail checks.

If implemented as an extension of dockerd itself, it brings unecessary complexity to that code, which can lead to a myrad of resource handling and stability issues. Do we really expect dockerd’s runtime to fork any sort of random healthcheck process while keeping things simple? That’s not just innefficient, it’s arguably dangerous at large scale. That’s a lot of responsibility to embrace from an OCI implementation alone.

On the other hand, it is not really possible to expect the monitored process itself to keep tabs in its own health status. More often than not, an unhealthy process becomes unresponsive and unable to perform a selfcheck.

Robust and reliable healthchecks only make sense from the point of view of an external agent. This is more of an orchestration issue than a problem with the OCI spec itself, which leads me to believe that the initial HEALTHCHECK directive was misplaced to begin with.

If your solution relies entirely on using Docker and nothing else, I would argue you have not delievered a complete solution. You need more tools, and I don’t expect dockerd itself to handle this. That is why platforms such as Swarm or Kubernetes implement those at the orchestration level.

At the moment, swarm is still not a drop-in replacement for docker-compose.

swarm lacks the capabilities for:

These are native docker engine capabilities that are exposed quite nicely by docker-compose. When greater orchestration capabilities are desired, sure, swarm or another orchestrator can be beneficial, but for some single-host applications it may be overkill.

I do agree that restart policy and healthchecks are really more in the domain of orchestration, but it is odd that the container engine exposes these features, but does not “close the loop” as many people have mentioned.

Ohh, another 1,5 years passed by.

As there is no roadmap and no long-term commitment to docker swarm clusters anymore is think swarm is a dead horse.

If you run your workload on a single server (when there is no need to run it high available; scaled up and load balanced) you may still want to keep container downtime to a bare minimum. A crash-on-every-problem approach, could trigger alot of restarts, and restarts takes time and makes your service being down until is up again. With a health-check you can skip restarting of containers e.g. when health-checks is flapping, effective downtime can be reduced that way. I do not see a reason why “health check”-based container-restarts should be an exclusive feature of orchestrators (or multi-node clusters).

The developers of podman decided to move restart-stuff to systemd instead of making it part of the container engine. So Following that aproach, we could define systemd timers to schedule health-checks (generated by podman). See https://developers.redhat.com/blog/2019/04/18/monitoring-container-vitality-and-availability-with-podman/ for details.

Maybe it is time to leave docker and move to podman.

@juliohm1978 I don’t understand the reasoning, dockerd already implements health checks and it already has the concept of restart policies. We don’t want it to become an orchestrator, just to restart the container when it is unhealthy. Similarly as it is done with on-failure[:max-retries], an on-unhealthy[:max-retries] restart policy could be implemented.

I don’t understand why this ability isn’t already apart of the [existing restart policy].

It’s either complete incompetence or a conspiracy for enshittification.

Who needs to get fired to make this right?

I am at a total loss with docker. I started using it just a month ago, but the countless hours i wasted into reading and trying docker stuff because everything just works a bit… inconsistantly (for example. the word “volume” is used a minimum of 3 times for features that work completely different, a volume in a dockerfile, can only define the internal path to the volume, the external is fixed, the volume in the docker compose in the service is a bind, you can define internal and external folder, the named volume you can just define external (i might have switched this and the dockerfile version). But if I want to use the same volume bind in multiple services, i cant use it named when i want to define external and internal pathes. Without chat gpt helping me around docker i would have already lost my mind. But I digress)

In my years of software development I have never had to deal with a documentation that is as convoluted and hard to comprehend as dockers. Examples are not minimal, there is always so much stuff thats not relevant to the current topic, etc.

And now I find a 5 year discussion that just deals with restarting unhealthy containers. there is tons of threads on the web how to solve that unhealthy containers are restartet and the go to solution is apparently to up another container that is just restarting unhealthy containers (which seams to get unhealthy by itself after some time, without anyone there to restart it), this is insane.

docker cost me years of my life in this single months.

anyway, I digress again, its been a few years, I agree with a lot that has been said here, but I still feel there should not be any healthchecks or restart policies at all if there is not the ultra logical step to connect those two. I didnt even consider that they are not connected already. I will look into swarm, but just for the little feature of restarting unhealthy containers it seams overkill

You want stable things that fix themselves. Make your software most intuitive to reach that goal - for me it was not: With the features for restart and health checks, it is only normal to assume that you can restart, when the check fails. I expected this to be the default. That means this will cost many developers a lot of research. What a waist of time.

Who can do a pull request? $40 from me for the developer that creates us that code. I’m not a go developer, but I found this file, where the health checks are handled:

https://github.com/moby/moby/blob/c2cc352355d4c26be60be8ea3a1acfddc20fdfd3/daemon/health.go#L117

Would it be enough to just check there if some actions was defined? I propose this format:

version: "3.7"
services:
  mongo:
    image: mongo:4.0-xenial
    healthcheck:
      test: "curl -sS http://127.0.0.1:3000 || exit 1"
      interval: 5s
      timeout: 5s
      retries: 3
      start_period: 5s
      # here come the additions:
      restart: on-failure  # enable restart
      action: "send mail to me and inform me"  # alternative action

Would be great if we do not have to wait for another 4 years.

shame

Making a new on-unhealthy restart was my own thought for this before I read this thread / suggestion. 😅

I think the biggest issue with it being a classical “restart policy” is that all the other restart policies are about when/whether to start and already-stopped container, where what we’re discussing here is when to actively stop (and eventually/potentially even kill) an actively running container.

I still do not believe we should even have healthchecking in the core runtime. Their existence, or the existence of restart policies, are not valid arguments that the functionality should be extended further. For that matter, to this day healthchecks are an extension of the OCI image spec (that is, the image format that Docker uses extends the OCI format to support them).

We do support restarting unhealthy containers, via swarm. This is not some ideological answer, it is a fact. Why is using swarm to obtain this functionality problematic?

Agree with @vRobM here. I think the whole point of a health-check is pre-emptive action including but certainly not limited to notification.

@felipecrs

Is this a realistic thing? I mean, Compose isn’t a daemon and a daemon would be required to keep watching the state of the containers, right?

You’re right it isn’t realistic.

@thaJeztah

Why orchestration keeps being brought into the conversation is beyond me. It’s simply not relevant to the docker daemon side conversation and keeps throwing it over the wall, to nowhere.

Bringing any other aspect other than the daemon functionality into the conversation including Compose is inappropriate and out of scope. Would you agree?

There are many layers for checks in a stack, so let’s keep the relevant checks in the appropriate layers without crossing layers.

Implementing this needs investigating, and a good look at the impact (both from an engineering perspective, as well as impact on resources). Healthchecks … ???

Please… this is not high impact and has nothing to do with healthchecks in any new way (unless you want to rewrite it to make it smarter, see the ideal part of the OP request). It’s a trigger; ie. a glorified IF THEN statement. Status is already kept, yet isn’t actionable.

The “scope” has been defined in the initial request and clarified many times over in the comments. I’m not sure how else to point out that this is a >>daemon only<< feature, which has been ignored and missing for over 4 years. It is the root cause of years of workarounds and frustrations. Just look at those comments and upvotes minus the irrelevant, misguided and out of scope ones.

The addition of the proposed feature solves a deep hierarchy of daemon issues that stem from it, and adds zero complications for any orchestration layers above, since it is a configurable flag. By default ON or OFF, either way. Easy peasy, right?

Can you now kindly guide a developer/team to the appropriate zoom level and clarity to get this resolved within the daemon-only scope?

Do let me know if you need additional clarity.

  • For “developer” scenarios, it could potentially be something implemented in Compose (have compose act as “orchestrator”) - that’s just from the top of my head though, would have to look at impact
  • So it may need to be looked at “where” the best place is for this (which could be client-side in compose)

Is this a realistic thing? I mean, Compose isn’t a daemon and a daemon would be required to keep watching the state of the containers, right?

Implementing this needs investigating, and a good look at the impact (both from an engineering perspective, as well as impact on resources). Healthchecks (and monitoring them) have been known to have issues in the past, so we must be careful when making changes in this area.

As far as I can see, the “push for Swarm” in this context was mainly because this feature is already implemented (but through swarm), which may address some of the use-cases (not all), but with a limited amount of maintainers/engineers and many tickets, priorities have to be set. This project is open-source, and contributions are always welcome (but some research and designing may be needed on this one to decide).

This ticket was brought up internally 2 weeks ago by our CEO, who asked about it. Let me post my reply here as well;

The ticket itself would probably be the best location, but it’s a somewhat tricky one;

  • the request in itself may seem reasonable at a glance, but there’s many potential caveats, and we need to have a good inventory of the use-cases and expectations
  • it’s a feature (on the daemon-side) that quickly is entering the “orchestration” layer, which is out of scope for docker run, so we must define scope of it.
  • For “production” scenarios there are already solutions (swarm or kubernetes), and there are documented workarounds on the tickets for other solutions.
  • For “developer” scenarios, it could potentially be something implemented in Compose (have compose act as “orchestrator”) - that’s just from the top of my head though, would have to look at impact
  • So it may need to be looked at “where” the best place is for this (which could be client-side in compose)
  • Overall; it’s been open for 6 years, with ~20 (non-maintainer) commenters in total, and with “finite time” among maintainers, never reached top of the list, but if a contributor wants to look at scope of the work and possible ways / design to implement, it could still be discussed if it falls within scope based on that.

@vRobM I don’t argue with your point. That’s exactly the reason I’m subscribed to this issue. Just noting that there’s a working and quite easy external solution

Just in case any1 wants a ready script - it was one of the first posts, look for docker restart.

@CHerSun of course, yet the point is that you should never have had to spend engineering hours/weeks/months developing external systems and further external integrations because this one pivotal feature has been ignored since 2016.

Shameful.

How many man hours could have been saved wondering why this behavior isn’t the default?

Im in a situation where I can’t use a swarm.

I don’t expect it to restart when the status switches to unhealthy. I expect to be able to configure an action when the status becomes unhealthy.

Me personally, I would create a rule that tries to restart it once while simultaneously logging the event and surrounding information (which would be sent to me) telling me that the container had to restart. Increase uptime as much as possible without losing the accountability to fix the underlying issue (hence the log). Best of both worlds for me personally. Just my two cents.

@ZealousMacwan swarm automatically kills and starts new containers for unhealthy services.