moby: Health checks are hard to implement because of return codes

I mentioned this during the review but it didn’t get addressed so reposting this as an issue.

We ask health checks to return 0 if they were successful (good), 1 if the failed (good) and 2 if the container is still initializing (weird).

A health check scripts has no way to know whether it’s in initialization period or not. This basically means that rather than being a simple command, a proper health check has to keep state and implement its own grace period.

I propose we get rid of 2. Actually, we get rid of all return codes - in the HEALTHCHECK instruction we could provide what a “good” exit code is (and we default to 0).

In order to achieve the “it’s still initializing” behavior, we put a grace period.

/cc @stevvooe @icecrime @tiborvass @runshenzhu

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 1
  • Comments: 27 (26 by maintainers)

Most upvoted comments

Unless we have health checks for the health checks, but see where this would lead us?

The case that was made for adding state 2 (my original version worked as you describe) was that of a database that is performing recovery after being restarted. There’s no reasonable way to set a grace period that will detect when the container has failed reasonably soon but won’t also kill the recovery.

For many services you can detect starting (e.g. postgres returns a special response for this I think). Or, e.g. if the pid file hasn’t been written yet it would be reasonable to assume the container is still starting.

@stevvooe would we loose much now if we decide to skip the “starting” stage for 1.12, and look into possibilities for 1.13, as @tiborvass suggested? (https://github.com/docker/docker/issues/24396#issuecomment-233017131). Given the amount of discussion, I think it’s safe to say it needs more investigation (insert “no is temporary…” here).

I think dropping it now is the safe option, and it will still give us the chance to add it in the next release. There’s a sh*tload of issues to be looked into for 1.12…

The part where I concur with @aluzzardi is that if we’re not sure and still debating this I’d rather have something that does less and is future proof. So if we validated exit codes and accept only 0 and 1 for now it gives us the option to add 2 in the future. Agree? On Fri, Jul 8, 2016 at 4:52 PM Andrea Luzzardi notifications@github.com wrote:

@aluzzardi https://github.com/aluzzardi if you have no way of knowing if the container is initialising, just return good

The corollary of that is “if you have no way of knowing if the container is initializing and you can’t reach it, return bad”.

Point being, we don’t have a grace period.

/cc @psftw https://github.com/psftw @tianon https://github.com/tianon @jpetazzo https://github.com/jpetazzo

I think we need to write a few examples for some of the library images and see how it works.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/docker/issues/24396#issuecomment-231498778, or mute the thread https://github.com/notifications/unsubscribe/AAye-8cYDdIY3ZZkEw6JY1bPXQSBZnNUks5qTuK2gaJpZM4JGpon .

@aluzzardi if you have no way of knowing if the container is initialising, just return good. The aim is for things that are slow start to be able to indicate they are not ready yet so staggered start can work - many clients have asked for something as eg mysql can take 15 seconds to start, and they want services that depend on it to start after, rather than be unable to connect to the database.

I don’t think we should allow defining what an exit code means, this is too much control. Simple 0 or 1 should be ok and is extremely friendly to anyone who has dealt with exit codes (which is anyone who has done system monitoring I should think).

Agree 2 is rather useless. A “starting” state would just be a container that has not returned a successful healthcheck yet.