cri-o: Container status cannot be retrieved when pod is in Terminating state

What happened?

While debugging why a cAdvisor (version 0.45.0) pod could not start in a node of a Kubernetes cluster, it was detected that was not possible to get the status of a container associated with a pod in Terminating state. Commands like crictl stats and crictl inspect hung when used in the container associated with this pod. Commands in other containers work as expected

The issue was reproduced in both crio 1.23 and crio 1.25

What did you expect to happen?

crictl commands should either return the status of the container or return an specific error indicating why the status cannot be retrieved

How can we reproduce it (as minimally and precisely as possible)?

  1. Deploy a pod that ignores SIGTERM signals if pod is deleted. Set the spec parameter terminationGracePeriodSeconds parameter to a really high value to avoid the kubelet sending a SIGKILL to the container associated with the pod
apiVersion: v1
kind: Pod
metadata:
  name: nginx-example
spec:
  containers:
  - name: nginx-example
    image: nginx
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 30; done;" ]
  terminationGracePeriodSeconds: 3600
  1. Wait for the pod to be running and then try to delete the pod (in the example, you should execute the command kubectl delete pod nginx-example). The delete command is synchronous but the SIGTERM is sent immediately, meaning the user will need to manually cancel the action with Ctrl+C but after that the pod will be left in Terminating` state
# kubectl  get pods  nginx-example
NAME            READY   STATUS        RESTARTS   AGE
nginx-example   1/1     Terminating   0          14m
  1. Try to inspect the container associated with the pod. The command will hung and as in the kubectl delete command, the user will need to cancel the action with Ctrl+C
# crictl pods | grep -i "nginx"
165bf13473d07       8 minutes ago       Ready               nginx-example                                           default                             0                   (default)

# crictl ps | grep -i 165bf13473d07
d418ba4dddd0a       nginx@sha256:63b44e8ddb83d5dd8020327c1f40436e37a6fffd3ef2498a6204df23be6e7e94                                            8 minutes ago       Running             nginx-example                                   0                   165bf13473d07


# time crictl inspect d418ba4dddd0a
^C

real    0m12.100s
user    0m0.012s
sys     0m0.019s

Anything else we need to know?

Taking a look at the code and debugging crio activating the pprof (–profile flag), it seems to be the status of the container is locked when it is sent to be deleted here.

1 @ 0x55b008e939d6 0x55b008ea37d2 0x55b00a483d31 0x55b00a48453c 0x55b00a47b922 0x55b00a4c6b31 0x55b00a528c5e 0x55b00a54ec86 0x55b0098b83fb 0x55b009c5cd43 0x55b00a55003a 0x55b00a316e12 0x55b00a55003a 0x55b00a54fedf 0x55b0098b82b8 0x55b009886c0f 0x55b00988a9ea 0x55b009884738 0x55b008ec4e01
#	0x55b00a483d30	github.com/cri-o/cri-o/internal/oci.WaitContainerStop+0x270				/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/internal/oci/runtime_oci.go:685
#	0x55b00a48453b	github.com/cri-o/cri-o/internal/oci.(*runtimeOCI).StopContainer+0x37b			/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/internal/oci/runtime_oci.go:771
#	0x55b00a47b921	github.com/cri-o/cri-o/internal/oci.(*Runtime).StopContainer+0x61			/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/internal/oci/oci.go:326
#	0x55b00a4c6b30	github.com/cri-o/cri-o/internal/lib.(*ContainerServer).StopContainer+0x50		/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/internal/lib/stop.go:14
#	0x55b00a528c5d	github.com/cri-o/cri-o/server.(*Server).StopContainer+0x29d				/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/server/container_stop.go:34
#	0x55b00a54ec85	github.com/cri-o/cri-o/server/cri/v1.(*service).StopContainer+0x25			/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/server/cri/v1/rpc_stop_container.go:12
#	0x55b0098b83fa	k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_StopContainer_Handler.func1+0x7a	/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/k8s.io/cri-api/pkg/apis/runtime/v1/api.pb.go:9019
#	0x55b009c5cd42	github.com/cri-o/cri-o/internal/log.UnaryInterceptor.func1+0xc2				/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/internal/log/interceptors.go:56
#	0x55b00a550039	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39		/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0x55b00a316e11	github.com/cri-o/cri-o/server/metrics.UnaryInterceptor.func1+0xb1			/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/server/metrics/interceptors.go:24
#	0x55b00a550039	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39		/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0x55b00a54fede	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1+0xbe		/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34
#	0x55b0098b82b7	k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_StopContainer_Handler+0x137		/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/k8s.io/cri-api/pkg/apis/runtime/v1/api.pb.go:9021
#	0x55b009886c0e	google.golang.org/grpc.(*Server).processUnaryRPC+0xcce					/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/google.golang.org/grpc/server.go:1282
#	0x55b00988a9e9	google.golang.org/grpc.(*Server).handleStream+0xa29					/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/google.golang.org/grpc/server.go:1616
#	0x55b009884737	google.golang.org/grpc.(*Server).serveStreams.func1.2+0x97				/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/google.golang.org/grpc/server.go:921

After that, the goroutine associated with the crictl commands try to get the same lock but since the other goroutine has acquired it, the gouroutine gets stuck

2 @ 0x55b008e939d6 0x55b008ea48cc 0x55b008ea48a6 0x55b008ec0b05 0x55b00a47817d 0x55b00a478153 0x55b00a5286b1 0x55b00a52838f 0x55b00a54f106 0x55b00a3da1b8 0x55b009c5cd43 0x55b00a55003a 0x55b00a316e12 0x55b00a55003a 0x55b00a54fedf 0x55b00a3da078 0x55b009886c0f 0x55b00988a9ea 0x55b009884738 0x55b008ec4e01
#	0x55b008ec0b04	sync.runtime_SemacquireMutex+0x24								/usr/lib64/go-1.17/src/runtime/sema.go:71
#	0x55b00a47817c	sync.(*RWMutex).RLock+0x5c									/usr/lib64/go-1.17/src/sync/rwmutex.go:63
#	0x55b00a478152	github.com/cri-o/cri-o/internal/oci.(*Container).State+0x32					/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/internal/oci/container.go:372
#	0x55b00a5286b0	github.com/cri-o/cri-o/server.(*Server).createContainerInfo+0xf0				/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/server/container_status.go:127
#	0x55b00a52838e	github.com/cri-o/cri-o/server.(*Server).ContainerStatus+0x7ee					/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/server/container_status.go:104
#	0x55b00a54f105	github.com/cri-o/cri-o/server/cri/v1alpha2.(*service).ContainerStatus+0x25			/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/server/cri/v1alpha2/rpc_container_status.go:14
#	0x55b00a3da1b7	k8s.io/cri-api/pkg/apis/runtime/v1alpha2._RuntimeService_ContainerStatus_Handler.func1+0x77	/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2/api.pb.go:9082
#	0x55b009c5cd42	github.com/cri-o/cri-o/internal/log.UnaryInterceptor.func1+0xc2					/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/internal/log/interceptors.go:56
#	0x55b00a550039	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39			/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0x55b00a316e11	github.com/cri-o/cri-o/server/metrics.UnaryInterceptor.func1+0xb1				/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/server/metrics/interceptors.go:24
#	0x55b00a550039	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39			/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0x55b00a54fede	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1+0xbe			/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34
#	0x55b00a3da077	k8s.io/cri-api/pkg/apis/runtime/v1alpha2._RuntimeService_ContainerStatus_Handler+0x137		/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2/api.pb.go:9084
#	0x55b009886c0e	google.golang.org/grpc.(*Server).processUnaryRPC+0xcce						/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/google.golang.org/grpc/server.go:1282
#	0x55b00988a9e9	google.golang.org/grpc.(*Server).handleStream+0xa29						/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/google.golang.org/grpc/server.go:1616
#	0x55b009884737	google.golang.org/grpc.(*Server).serveStreams.func1.2+0x97					/home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/google.golang.org/grpc/server.go:921

Full stacktrace

CRI-O and Kubernetes version

# crio --version
crio version 1.23.4
Version:          1.23.4
GitCommit:        a6a1e6ebf4c4baca7ca7d57f069a4e0cd89056cd
GitTreeState:     clean
BuildDate:        2022-10-21T16:26:30Z
GoVersion:        go1.17.5
Compiler:         gc
Platform:         linux/amd64
Linkmode:         dynamic
BuildTags:        exclude_graphdriver_devicemapper, seccomp
SeccompEnabled:   true
AppArmorEnabled:  false
# kubectl  version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.12", GitCommit:"c6939792865ef0f70f92006081690d77411c8ed5", GitTreeState:"clean", BuildDate:"2022-09-21T12:20:29Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.12", GitCommit:"c6939792865ef0f70f92006081690d77411c8ed5", GitTreeState:"clean", BuildDate:"2022-09-21T12:13:07Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}

OS version

# On Linux:
$ cat /etc/os-release
[root@sc4-r14-u10 ~]# cat /etc/os-release
NAME="Oracle Linux Server"
VERSION="8.6"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="8.6"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Oracle Linux Server 8.6"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:8:6:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://bugzilla.oracle.com/"

ORACLE_BUGZILLA_PRODUCT="Oracle Linux 8"
ORACLE_BUGZILLA_PRODUCT_VERSION=8.6
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=8.6

# uname -a
Linux sc4-r14-u10 4.18.0-372.32.1.0.1.el8_6.x86_64 #1 SMP Tue Oct 25 10:51:38 PDT 2022 x86_64 x86_64 x86_64 GNU/Linux

Additional environment details (AWS, VirtualBox, physical, etc.)

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 18 (9 by maintainers)

Most upvoted comments

yes I can begin the process. the main version has merged, so I’ll start cherry-picking it back

to fix the stats piece though, you can set the stats_collection_period to something other than 0 and it will collect asynchronously. then, when you call crictl stats it will return cached information and not gather on demand, which requires taking the container’s opLock, which is held during termination