kubernetes: kubectl exec can unexpectedly exit 0
What happened:
A little while after running kubectl exec -it <pod> bash
, in the middle of running a long command I got kicked out of the kubectl exec
back into the outer shell, with kubectl
exiting 0. The pod didn’t restart or anything like that.
What you expected to happen:
kubectl exec
should never exit 0 unless the command it was executing exits 0 itself.
How to reproduce it (as minimally and precisely as possible):
Not sure.
Anything else we need to know?:
Here is an excerpt of the logs that is relevant:
docker-desktop:/var/lib/log# fgrep 2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee *
docker.log:2019-01-17T23:49:37Z docker time="2019-01-17T23:49:37.025234900Z" level=debug msg="Calling POST /v1.38/exec/2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee/start"
docker.log:2019-01-17T23:49:37Z docker time="2019-01-17T23:49:37.025894500Z" level=debug msg="starting exec command 2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee in container e41e3d8c4373adbb199b0a287191427391b789e7a42d8ee90609f416bbbe3de8"
docker.log:2019-01-17T23:49:37Z docker time="2019-01-17T23:49:37.033666200Z" level=debug msg="Calling POST /v1.38/exec/2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee/resize?h=56&w=204"
docker.log:2019-01-17T23:56:05Z docker time="2019-01-17T23:56:05.619356500Z" level=debug msg="Calling GET /v1.38/exec/2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee/json"
docker.log:2019-01-17T23:56:07Z docker time="2019-01-17T23:56:07.619562100Z" level=debug msg="Calling GET /v1.38/exec/2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee/json"
docker.log:2019-01-17T23:56:09Z docker time="2019-01-17T23:56:09.619597700Z" level=debug msg="Calling GET /v1.38/exec/2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee/json"
docker.log:2019-01-17T23:56:11Z docker time="2019-01-17T23:56:11.620721000Z" level=debug msg="Calling GET /v1.38/exec/2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee/json"
docker.log:2019-01-17T23:56:13Z docker time="2019-01-17T23:56:13.619711500Z" level=debug msg="Calling GET /v1.38/exec/2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee/json"
docker.log:2019-01-17T23:56:13Z docker E0117 23:56:13.621199 1281 exec.go:127] Exec session 2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee in container e41e3d8c4373adbb199b0a287191427391b789e7a42d8ee90609f416bbbe3de8 terminated but process still running!
kubelet.log:2019-01-17T23:56:13Z kubelet E0117 23:56:13.621199 1281 exec.go:127] Exec session 2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee in container e41e3d8c4373adbb199b0a287191427391b789e7a42d8ee90609f416bbbe3de8 terminated but process still running!
It’s not clear why kubectl exec
suddenly exited, however looking at the code, it’s obvious that it can potentially exit 0 when this weird situation occurs (from kubelet/dockershim/exec.go
):
err = client.StartExec(execObj.ID, startOpts, streamOpts)
// ...
for {
inspect, err2 := client.InspectExec(execObj.ID)
// ...
count++
if count == 5 {
klog.Errorf("Exec session %s in container %s terminated but process still running!", execObj.ID, container.ID)
break // <--- bug if we break here with err == nil
}
<-ticker.C
}
return err
If we hit the break
annotated above, I believe that will cause kubectl exec
to exit 0, despite this being an error condition. This means that error handling at the level above is not possible. We need to be able to trust that kubectl exec
only ever exits 0 if the underlying command it ran exited 0.
So I believe if err == nil
before hitting the break
it should be set to something that will make kubectl exec
exit with a non-zero exit code.
Environment:
- Kubernetes version (use
kubectl version
)
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T19:44:10Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T20:56:12Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: Docker for Mac version 2.0.1.0 (30090)
- OS (e.g. from /etc/os-release): macOS 10.13.6
- Kernel (e.g.
uname -a
):Darwin hactar.localdomain 17.7.0 Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64 x86_64
- Install tools: k8s through Docker for Mac
- Others: Docker 18.09.1
/sig cli
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 6
- Comments: 55 (17 by maintainers)
Hey,
I think that I am encountering this issue as well, my use case is that I am running a long query on Mongo via kubectl exec and after about 20 minutes I am kicked out of the pod. In the syslog of the Node I have this error:
Would be great to know what the solution is for this.
Thanks, Richard
Fight with the “automation” begins 😃
/remove-lifecycle stale
It does not seem to have anything to do with the runtime, or whether data is sent or not. Still trying to figure out what triggers it. In any case, a PR that ensures that
kubectl
does proper error reporting will go a long way in figuring out what makes it bail.Notes
Below are some tests on my Drupal pod. But first, some notes:
pv
on my machine to see how much data is transferred.kubectl -n ns -it exec deploy/drupal-deployment --container=drupal --
, but I leave it out for brevity.kubectl exec
stops with code 0, but the process keeps running in the container.kubectl -n ns -it exec deploy/drupal-deployment --container=drupal -- cat /dev/urandom | pv > /dev/null
, the pipe topv
is done on my local machine, butcat /dev/urandom
runs in the container.The pipe tests
drush sql-dump --gzip | pv > tmp/db.sql.gz
— has the issue. The.sql.gz
file has “unexpected end of file”. Transfers about 50MiB in 23s before it dies.drush sql-dump | pv > tmp/db.sql
— has the issue. The.sql
file also stops in the middle of a table, so we don’t have the entire database. Transfers about 387MiB in 3min before it dies.watch date | pv > /dev/null
— does NOT have the issue. It keeps running for a very long time without issues, so it rules out the possibility of a timeout somewhere. I’ll edit this if it dies at some point.cat /dev/urandom | pv > /dev/null
— does NOT have the issue. Same aswatch date
. Transferred about 1GiB in 0:07:42 at the time of writing, and still going strong, So that rules out a network bandwidth limitation.Exit code test
This works as expected, so it rules out the possibility of the exec’d command completing with non-zero exit code:
Running the same commands interactively
I opened an interactive bash shell, and tried the failing commands from above:
No problem here. database dump is complete, and kubectl stays connected:
Also works with gzip. kubectl stays connected:
So that rules out the exec’d command using too much CPU or memory, because these are the same commands as in the above tests.
cat
existing file testThis left be wondering if the exec’d command writes something to the stream that makes kubectl choke up, and it just hasn’t come up in
/dev/urandom yet
, but it’s part of my database dump. So let’s test that…I can see in
less
that the file is cut off before the end! It should have transferred 394MiB, but we only have 392MiB.This is cut short again: 48.1MiB instead of 49.9MiB. But this is where it gets really weird. Compare the first few lines of the file on the source, vs. the file we got by downloading it through
kubectl exec ... cat > ...
:Not surprisingly, the downloaded file can’t be extracted completely:
Trying to verify that the problem is not with different versions of gzip or something like that, I tried
kubectl cp
, and that works fine. The.sql.gz
file has the proper size, and can be extracted on my local machine just like in the container.Verdict
kubectl exec has issues transferring stdout data to the local machine.
I’m surprised, this has gone on for so long. Apparently, this is by design:
"--streaming-connection-idle-timeout=30m"
(see https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)in my deployment gives me the observed exact 30 minutes before disconnecting. What’s worse is that it exits with an exit code of 0. This may be fine in some cases, but problematic in others. It seems, for example, that gitlab’s kubernetes executor sends the entire job script into its pod as a single kubectl exec and should it time out, it aborts mid-job, but returning a “Job succeeded”.
I have v1.16.3 on macOS and I still see this issue nearly on a daily basis. It generally appears when I leave my laptop to sleep overnight. Seems like something under the hood is timing out but
kubectl exec
exits zero and the command it started keeps running in the container.