kubernetes: kubectl exec can unexpectedly exit 0

What happened:

A little while after running kubectl exec -it <pod> bash, in the middle of running a long command I got kicked out of the kubectl exec back into the outer shell, with kubectl exiting 0. The pod didn’t restart or anything like that.

What you expected to happen:

kubectl exec should never exit 0 unless the command it was executing exits 0 itself.

How to reproduce it (as minimally and precisely as possible):

Not sure.

Anything else we need to know?:

Here is an excerpt of the logs that is relevant:

docker-desktop:/var/lib/log# fgrep 2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee *
docker.log:2019-01-17T23:49:37Z docker time="2019-01-17T23:49:37.025234900Z" level=debug msg="Calling POST /v1.38/exec/2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee/start"
docker.log:2019-01-17T23:49:37Z docker time="2019-01-17T23:49:37.025894500Z" level=debug msg="starting exec command 2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee in container e41e3d8c4373adbb199b0a287191427391b789e7a42d8ee90609f416bbbe3de8"
docker.log:2019-01-17T23:49:37Z docker time="2019-01-17T23:49:37.033666200Z" level=debug msg="Calling POST /v1.38/exec/2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee/resize?h=56&w=204"
docker.log:2019-01-17T23:56:05Z docker time="2019-01-17T23:56:05.619356500Z" level=debug msg="Calling GET /v1.38/exec/2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee/json"
docker.log:2019-01-17T23:56:07Z docker time="2019-01-17T23:56:07.619562100Z" level=debug msg="Calling GET /v1.38/exec/2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee/json"
docker.log:2019-01-17T23:56:09Z docker time="2019-01-17T23:56:09.619597700Z" level=debug msg="Calling GET /v1.38/exec/2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee/json"
docker.log:2019-01-17T23:56:11Z docker time="2019-01-17T23:56:11.620721000Z" level=debug msg="Calling GET /v1.38/exec/2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee/json"
docker.log:2019-01-17T23:56:13Z docker time="2019-01-17T23:56:13.619711500Z" level=debug msg="Calling GET /v1.38/exec/2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee/json"
docker.log:2019-01-17T23:56:13Z docker E0117 23:56:13.621199    1281 exec.go:127] Exec session 2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee in container e41e3d8c4373adbb199b0a287191427391b789e7a42d8ee90609f416bbbe3de8 terminated but process still running!
kubelet.log:2019-01-17T23:56:13Z kubelet E0117 23:56:13.621199    1281 exec.go:127] Exec session 2d15caa40e0b44d0640ff2bd8a679107f8a9cd888d157fe22c42debf620b29ee in container e41e3d8c4373adbb199b0a287191427391b789e7a42d8ee90609f416bbbe3de8 terminated but process still running!

It’s not clear why kubectl exec suddenly exited, however looking at the code, it’s obvious that it can potentially exit 0 when this weird situation occurs (from kubelet/dockershim/exec.go):

	err = client.StartExec(execObj.ID, startOpts, streamOpts)
	// ...
	for {
		inspect, err2 := client.InspectExec(execObj.ID)
		// ...

		count++
		if count == 5 {
			klog.Errorf("Exec session %s in container %s terminated but process still running!", execObj.ID, container.ID)
			break // <--- bug if we break here with err == nil
		}

		<-ticker.C
	}

	return err

If we hit the break annotated above, I believe that will cause kubectl exec to exit 0, despite this being an error condition. This means that error handling at the level above is not possible. We need to be able to trust that kubectl exec only ever exits 0 if the underlying command it ran exited 0.

So I believe if err == nil before hitting the break it should be set to something that will make kubectl exec exit with a non-zero exit code.

Environment:

  • Kubernetes version (use kubectl version)
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T19:44:10Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T20:56:12Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: Docker for Mac version 2.0.1.0 (30090)
  • OS (e.g. from /etc/os-release): macOS 10.13.6
  • Kernel (e.g. uname -a): Darwin hactar.localdomain 17.7.0 Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64 x86_64
  • Install tools: k8s through Docker for Mac
  • Others: Docker 18.09.1

/sig cli

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 6
  • Comments: 55 (17 by maintainers)

Most upvoted comments

Hey,

I think that I am encountering this issue as well, my use case is that I am running a long query on Mongo via kubectl exec and after about 20 minutes I am kicked out of the pod. In the syslog of the Node I have this error:

Exec session in container <id> terminated but process still running!

Would be great to know what the solution is for this.

Thanks, Richard

Fight with the “automation” begins 😃

/remove-lifecycle stale

It does not seem to have anything to do with the runtime, or whether data is sent or not. Still trying to figure out what triggers it. In any case, a PR that ensures that kubectl does proper error reporting will go a long way in figuring out what makes it bail.

Notes

Below are some tests on my Drupal pod. But first, some notes:

  • I’m piping it through pv on my machine to see how much data is transferred.
  • Each command in this list is prefixed with kubectl -n ns -it exec deploy/drupal-deployment --container=drupal --, but I leave it out for brevity.
  • Wherever I say a test “has the issue”, I mean that kubectl exec stops with code 0, but the process keeps running in the container.
  • When I do something like kubectl -n ns -it exec deploy/drupal-deployment --container=drupal -- cat /dev/urandom | pv > /dev/null, the pipe to pv is done on my local machine, but cat /dev/urandom runs in the container.

The pipe tests

  • drush sql-dump --gzip | pv > tmp/db.sql.gz — has the issue. The .sql.gz file has “unexpected end of file”. Transfers about 50MiB in 23s before it dies.
  • drush sql-dump | pv > tmp/db.sql — has the issue. The .sql file also stops in the middle of a table, so we don’t have the entire database. Transfers about 387MiB in 3min before it dies.
  • watch date | pv > /dev/null — does NOT have the issue. It keeps running for a very long time without issues, so it rules out the possibility of a timeout somewhere. I’ll edit this if it dies at some point.
  • cat /dev/urandom | pv > /dev/null — does NOT have the issue. Same as watch date. Transferred about 1GiB in 0:07:42 at the time of writing, and still going strong, So that rules out a network bandwidth limitation.

Exit code test

This works as expected, so it rules out the possibility of the exec’d command completing with non-zero exit code:

$ kubectl -n ns -it exec deploy/drupal-deployment --container=drupal -- /bin/sh -c "exit 1"
command terminated with exit code 1
$ echo $?
1

Running the same commands interactively

I opened an interactive bash shell, and tried the failing commands from above:

$ kubectl -n ns -it exec deploy/drupal-deployment --container=drupal -- /bin/bash

No problem here. database dump is complete, and kubectl stays connected:

bash-5.1# drush sql:dump | pv > /tmp/db.sql
 394MiB 0:00:11 [35.5MiB/s]
bash-5.1# tail -n 1 /tmp/db.sql
-- Dump completed on 2022-02-04  9:31:24

Also works with gzip. kubectl stays connected:

bash-5.1# drush sql:dump --gzip | pv > /tmp/db.sql.gz
49.9MiB 0:00:18 [2.69MiB/s]
bash-5.1# zcat /tmp/db.sql.gz | tail -n 1
-- Dump completed on 2022-02-04  9:33:50

So that rules out the exec’d command using too much CPU or memory, because these are the same commands as in the above tests.

cat existing file test

This left be wondering if the exec’d command writes something to the stream that makes kubectl choke up, and it just hasn’t come up in /dev/urandom yet, but it’s part of my database dump. So let’s test that…

$ kubectl -n ns -it exec deploy/drupal-deployment --container=drupal -- cat /tmp/db.sql | pv > tmp/db.sql
 392MiB 0:02:59 [2.18MiB/s]
$ less tmp/db.sql

I can see in less that the file is cut off before the end! It should have transferred 394MiB, but we only have 392MiB.

$ kubectl -n ns -it exec deploy/drupal-deployment --container=drupal -- cat /tmp/db.sql.gz | pv > tmp/db.sql.gz
48.1MiB 0:00:21 [2.20MiB/s]

This is cut short again: 48.1MiB instead of 49.9MiB. But this is where it gets really weird. Compare the first few lines of the file on the source, vs. the file we got by downloading it through kubectl exec ... cat > ...:

$ kubectl -n ns -it exec deploy/drupal-deployment --container=drupal -- /bin/sh -c "zcat /tmp/db.sql.gz | head"
-- MariaDB dump 10.19  Distrib 10.5.13-MariaDB, for Linux (x86_64)
--
-- Host: db-primary    Database: drupal
-- ------------------------------------------------------
-- Server version       10.5.13-MariaDB-log

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;
$ zcat tmp/db.sql.gz | head
-- MariaDB dump 10.19  Distrib 10.5.13-MariaDB, for Linux (x86_64)
--
-- Host: db-primary    Database: drupal
-- ------------------------------------------------------
-- Server version       10.5.13-MariaDB-log

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTIONb-primary    Databrupal
--UeES utf8mb4ry    Databr3LD_COLLATITIME_ZONEATITIME_ZONEry    Databr3LD_COTIME_ZONEA'+00:00'ry    Data014LD_COLLATIUNIQUETERECKRACTUNIQUETERECKR, UNIQUETERECKRA0ry    Data014LD_COLLATIFOREIGN_KEYTERECKRACTFOREIGN_KEYTERECKR, FOREIGN_KEYTERECKRA0!40101 SET @OLD_COLLATISQL_MODEATISQL_MODE, SQL_MODEA'NO_AUTO_VALUETECTZERO'ry    Data11OLD_COLLATISQL_NOTESATISQL_NOTES, SQL_NOTESA0!4010Host: dTable 0.5ucturenux (table `batch`Host

Not surprisingly, the downloaded file can’t be extracted completely:

$ zcat tmp/db.sql.gz > /dev/null

gzip: tmp/db.sql.gz: invalid compressed data--format violated

Trying to verify that the problem is not with different versions of gzip or something like that, I tried kubectl cp, and that works fine. The .sql.gz file has the proper size, and can be extracted on my local machine just like in the container.

$ kubectl cp ns/drupal-deployment-66795dfd9d-ft56s:/tmp/db.sql.gz -c drupal tmp/cp_db.sql.gz
tar: Removing leading `/' from member names
$ zcat tmp/cp_db.sql.gz | tail -n 1
-- Dump completed on 2022-02-04  9:33:50

Verdict

kubectl exec has issues transferring stdout data to the local machine.

I’m surprised, this has gone on for so long. Apparently, this is by design:

"--streaming-connection-idle-timeout=30m" (see https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)

in my deployment gives me the observed exact 30 minutes before disconnecting. What’s worse is that it exits with an exit code of 0. This may be fine in some cases, but problematic in others. It seems, for example, that gitlab’s kubernetes executor sends the entire job script into its pod as a single kubectl exec and should it time out, it aborts mid-job, but returning a “Job succeeded”.

I have v1.16.3 on macOS and I still see this issue nearly on a daily basis. It generally appears when I leave my laptop to sleep overnight. Seems like something under the hood is timing out but kubectl exec exits zero and the command it started keeps running in the container.