skaffold: Skaffold fails when a container's status code is 1
I have different pods that depends on mysql to start.
Each one of those pods implements the proper readinessProbe and livenessProbe to check the status of the pod.
If the application cannot start, the container stops with exit 1 because connecting to MySQL is necessary to get the application to work.
Since there is no management of dependencies in kubernetes, the recommended way to handle this is to have the pod restart the containers until it works.
Currently, when I run skaffold dev, it fails and deletes the resources immediately instead of waiting. If I do skaffold run, it also reports a failure, but then 2~3 seconds later the pods created by skaffold are running properly in the cluster.
Expected behavior
Skaffold waits for kubernetes to get the pod running and does not check the individual container’s exit codes.
It should only stop and delete the resources if the pod still fails after statusCheckDeadlineSeconds seconds.
Actual behavior
Skaffold does not start the dev mode and deletes the resources from the cluster.
- deployment/mysql is ready. [2/3 deployment(s) still pending]
- deployment/service-b: container service-b terminated with exit code 1
- pod/service-b-59995f6f5f-bmg74: container service-b terminated with exit code 1
> [service-b-59995f6f5f-bmg74 service-b] 2021/01/07 07:46:00 Failed to initialize the migration driver dial tcp 10.97.179.35:3306: connect: connection refused
- deployment/service-b failed. Error: container service-b terminated with exit code 1.
- deployment/service-a: container service-a terminated with exit code 1
- pod/service-a-7697f8bdd4-gg9rv: container service-a terminated with exit code 1
> [service-a-7697f8bdd4-gg9rv service-a] 2021/01/07 07:46:00 Failed to initialize the migration driver dial tcp 10.97.179.35:3306: connect: connection refused
- deployment/service-a failed. Error: container service-a terminated with exit code 1.
Cleaning up...
- configmap "mysql" deleted
- deployment.apps "mysql" deleted
- service "mysql" deleted
- deployment.apps "service-a" deleted
- service "service-a" deleted
- deployment.apps "service-b" deleted
- service "service-b" deleted
exiting dev mode because first deploy failed: 2/3 deployment(s) failed
Information
- Skaffold version: bleeding edge 35214eb
- Operating system: Debian testing / minikube
- Contents of skaffold.yaml:
apiVersion: skaffold/v2beta10
kind: Config
build:
local:
concurrency: 0
useBuildkit: true
artifacts:
[confidential]
deploy:
statusCheckDeadlineSeconds: 60
kubectl:
manifests:
- ./build/kubernetes/*
test:
- image: [confidential]
structureTests:
- './build/tests/*'
Steps to reproduce the behavior
Have a pod that exit 1 before mysql starts, and then runs successfully.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 6
- Comments: 17 (9 by maintainers)
Commits related to this issue
- Fix for #5210 Remove these errors from the unrecoverable list. Container errors are recoverable in a K8S environment, they may be waiting for another resource to become stable e.g. — committed to casret/skaffold by deleted user 3 years ago
- Fix for #5210 Remove these errors from the unrecoverable list. Container errors are recoverable in a K8S environment, they may be waiting for another resource to become stable e.g. — committed to casret/skaffold by casret 3 years ago
- Fix for #5210 Remove these from the unrecoverable errors list. Containers are ephemeral in k8s, so errors in them may be recoverable at a system level. E.g. when they are waiting for another resour... — committed to casret/skaffold by casret 3 years ago
@foobarbecue thanks for the context. I think probably what we should do here is expose a
allowPodRestartflag or something similar that backs off the status check when it sees a failure, to give the pods time to go through some restart cycles before actually calling the deployments failed.Recently this PR was merged: https://github.com/GoogleContainerTools/skaffold/pull/8047
Which adds the
--tolerate-failures-until-deadline=[true|false]flag to Skaffold as well as the below skaffold config optionto skaffold. This has not been added to our docs site yet, there is an issue tracking that here: https://github.com/GoogleContainerTools/skaffold/issues/8060
With the option enabled, Skaffold will wait for all containers to be successful until the given
statusCheckDeadlineSecondstimeout (vs the normal behaviour of failing if a single deploy/container fails). This way “flapping” deployments can be supported better for dev and CI/CD usageI believe the feature above should resolve this issue. Will wait until the docs issue/PR is closed and then I will close this.