skaffold: Skaffold fails when a container's status code is 1

I have different pods that depends on mysql to start. Each one of those pods implements the proper readinessProbe and livenessProbe to check the status of the pod. If the application cannot start, the container stops with exit 1 because connecting to MySQL is necessary to get the application to work. Since there is no management of dependencies in kubernetes, the recommended way to handle this is to have the pod restart the containers until it works.

Currently, when I run skaffold dev, it fails and deletes the resources immediately instead of waiting. If I do skaffold run, it also reports a failure, but then 2~3 seconds later the pods created by skaffold are running properly in the cluster.

Expected behavior

Skaffold waits for kubernetes to get the pod running and does not check the individual container’s exit codes. It should only stop and delete the resources if the pod still fails after statusCheckDeadlineSeconds seconds.

Actual behavior

Skaffold does not start the dev mode and deletes the resources from the cluster.

 - deployment/mysql is ready. [2/3 deployment(s) still pending]
 - deployment/service-b: container service-b terminated with exit code 1
    - pod/service-b-59995f6f5f-bmg74: container service-b terminated with exit code 1
      > [service-b-59995f6f5f-bmg74 service-b] 2021/01/07 07:46:00 Failed to initialize the migration driver dial tcp 10.97.179.35:3306: connect: connection refused
 - deployment/service-b failed. Error: container service-b terminated with exit code 1.
 - deployment/service-a: container service-a terminated with exit code 1
    - pod/service-a-7697f8bdd4-gg9rv: container service-a terminated with exit code 1
      > [service-a-7697f8bdd4-gg9rv service-a] 2021/01/07 07:46:00 Failed to initialize the migration driver dial tcp 10.97.179.35:3306: connect: connection refused
 - deployment/service-a failed. Error: container service-a terminated with exit code 1.
 Cleaning up...
 - configmap "mysql" deleted
 - deployment.apps "mysql" deleted
 - service "mysql" deleted
 - deployment.apps "service-a" deleted
 - service "service-a" deleted
 - deployment.apps "service-b" deleted
 - service "service-b" deleted
exiting dev mode because first deploy failed: 2/3 deployment(s) failed

Information

Skaffold version: bleeding edge 35214eb
Operating system: Debian testing / minikube
Contents of skaffold.yaml:

apiVersion: skaffold/v2beta10
kind: Config
build:
  local:
    concurrency: 0
    useBuildkit: true
  artifacts:
    [confidential]
deploy:
  statusCheckDeadlineSeconds: 60
  kubectl:
    manifests:
      - ./build/kubernetes/*
test:
  - image: [confidential]
    structureTests:
      - './build/tests/*'

Steps to reproduce the behavior

Have a pod that exit 1 before mysql starts, and then runs successfully.

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 6
Comments: 17 (9 by maintainers)

Commits related to this issue

Fix for #5210 Remove these errors from the unrecoverable list. Container errors are recoverable in a K8S environment, they may be waiting for another resource to become stable e.g. — committed to casret/skaffold by deleted user 3 years ago
Fix for #5210 Remove these errors from the unrecoverable list. Container errors are recoverable in a K8S environment, they may be waiting for another resource to become stable e.g. — committed to casret/skaffold by casret 3 years ago
Fix for #5210 Remove these from the unrecoverable errors list. Containers are ephemeral in k8s, so errors in them may be recoverable at a system level. E.g. when they are waiting for another resour... — committed to casret/skaffold by casret 3 years ago

Most upvoted comments

@foobarbecue thanks for the context. I think probably what we should do here is expose a allowPodRestart flag or something similar that backs off the status check when it sees a failure, to give the pods time to go through some restart cycles before actually calling the deployments failed.

nkubala on Aug 4, 2021

Recently this PR was merged: https://github.com/GoogleContainerTools/skaffold/pull/8047

Which adds the --tolerate-failures-until-deadline=[true|false] flag to Skaffold as well as the below skaffold config option

deploy:
  tolerateFailuresUntilDeadline: true #false is default

to skaffold. This has not been added to our docs site yet, there is an issue tracking that here: https://github.com/GoogleContainerTools/skaffold/issues/8060

With the option enabled, Skaffold will wait for all containers to be successful until the given statusCheckDeadlineSeconds timeout (vs the normal behaviour of failing if a single deploy/container fails). This way “flapping” deployments can be supported better for dev and CI/CD usage

I believe the feature above should resolve this issue. Will wait until the docs issue/PR is closed and then I will close this.

aaron-prindle on Dec 5, 2022