skaffold: Skaffold fails when a container's status code is 1

I have different pods that depends on mysql to start. Each one of those pods implements the proper readinessProbe and livenessProbe to check the status of the pod. If the application cannot start, the container stops with exit 1 because connecting to MySQL is necessary to get the application to work. Since there is no management of dependencies in kubernetes, the recommended way to handle this is to have the pod restart the containers until it works.

Currently, when I run skaffold dev, it fails and deletes the resources immediately instead of waiting. If I do skaffold run, it also reports a failure, but then 2~3 seconds later the pods created by skaffold are running properly in the cluster.

Expected behavior

Skaffold waits for kubernetes to get the pod running and does not check the individual container’s exit codes. It should only stop and delete the resources if the pod still fails after statusCheckDeadlineSeconds seconds.

Actual behavior

Skaffold does not start the dev mode and deletes the resources from the cluster.

 - deployment/mysql is ready. [2/3 deployment(s) still pending]
 - deployment/service-b: container service-b terminated with exit code 1
    - pod/service-b-59995f6f5f-bmg74: container service-b terminated with exit code 1
      > [service-b-59995f6f5f-bmg74 service-b] 2021/01/07 07:46:00 Failed to initialize the migration driver dial tcp 10.97.179.35:3306: connect: connection refused
 - deployment/service-b failed. Error: container service-b terminated with exit code 1.
 - deployment/service-a: container service-a terminated with exit code 1
    - pod/service-a-7697f8bdd4-gg9rv: container service-a terminated with exit code 1
      > [service-a-7697f8bdd4-gg9rv service-a] 2021/01/07 07:46:00 Failed to initialize the migration driver dial tcp 10.97.179.35:3306: connect: connection refused
 - deployment/service-a failed. Error: container service-a terminated with exit code 1.
 Cleaning up...
 - configmap "mysql" deleted
 - deployment.apps "mysql" deleted
 - service "mysql" deleted
 - deployment.apps "service-a" deleted
 - service "service-a" deleted
 - deployment.apps "service-b" deleted
 - service "service-b" deleted
exiting dev mode because first deploy failed: 2/3 deployment(s) failed

Information

  • Skaffold version: bleeding edge 35214eb
  • Operating system: Debian testing / minikube
  • Contents of skaffold.yaml:
apiVersion: skaffold/v2beta10
kind: Config
build:
  local:
    concurrency: 0
    useBuildkit: true
  artifacts:
    [confidential]
deploy:
  statusCheckDeadlineSeconds: 60
  kubectl:
    manifests:
      - ./build/kubernetes/*
test:
  - image: [confidential]
    structureTests:
      - './build/tests/*'

Steps to reproduce the behavior

Have a pod that exit 1 before mysql starts, and then runs successfully.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 6
  • Comments: 17 (9 by maintainers)

Commits related to this issue

Most upvoted comments

@foobarbecue thanks for the context. I think probably what we should do here is expose a allowPodRestart flag or something similar that backs off the status check when it sees a failure, to give the pods time to go through some restart cycles before actually calling the deployments failed.

Recently this PR was merged: https://github.com/GoogleContainerTools/skaffold/pull/8047

Which adds the --tolerate-failures-until-deadline=[true|false] flag to Skaffold as well as the below skaffold config option

deploy:
  tolerateFailuresUntilDeadline: true #false is default

to skaffold. This has not been added to our docs site yet, there is an issue tracking that here: https://github.com/GoogleContainerTools/skaffold/issues/8060

With the option enabled, Skaffold will wait for all containers to be successful until the given statusCheckDeadlineSeconds timeout (vs the normal behaviour of failing if a single deploy/container fails). This way “flapping” deployments can be supported better for dev and CI/CD usage

I believe the feature above should resolve this issue. Will wait until the docs issue/PR is closed and then I will close this.