cri-o: Container status cannot be retrieved when pod is in Terminating state
What happened?
While debugging why a cAdvisor (version 0.45.0) pod could not start in a node of a Kubernetes cluster, it was detected that was not possible to get the status of a container associated with a pod in Terminating
state. Commands like crictl stats
and crictl inspect
hung when used in the container associated with this pod. Commands in other containers work as expected
The issue was reproduced in both crio 1.23 and crio 1.25
What did you expect to happen?
crictl commands should either return the status of the container or return an specific error indicating why the status cannot be retrieved
How can we reproduce it (as minimally and precisely as possible)?
- Deploy a pod that ignores SIGTERM signals if pod is deleted. Set the spec parameter
terminationGracePeriodSeconds
parameter to a really high value to avoid the kubelet sending a SIGKILL to the container associated with the pod
apiVersion: v1
kind: Pod
metadata:
name: nginx-example
spec:
containers:
- name: nginx-example
image: nginx
imagePullPolicy: IfNotPresent
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
terminationGracePeriodSeconds: 3600
- Wait for the pod to be running and then try to delete the pod (in the example, you should execute the command
kubectl delete pod nginx-example). The delete command is synchronous but the SIGTERM is sent immediately, meaning the user will need to manually cancel the action with Ctrl+C but after that the pod will be left in
Terminating` state
# kubectl get pods nginx-example
NAME READY STATUS RESTARTS AGE
nginx-example 1/1 Terminating 0 14m
- Try to inspect the container associated with the pod. The command will hung and as in the
kubectl delete
command, the user will need to cancel the action with Ctrl+C
# crictl pods | grep -i "nginx"
165bf13473d07 8 minutes ago Ready nginx-example default 0 (default)
# crictl ps | grep -i 165bf13473d07
d418ba4dddd0a nginx@sha256:63b44e8ddb83d5dd8020327c1f40436e37a6fffd3ef2498a6204df23be6e7e94 8 minutes ago Running nginx-example 0 165bf13473d07
# time crictl inspect d418ba4dddd0a
^C
real 0m12.100s
user 0m0.012s
sys 0m0.019s
Anything else we need to know?
Taking a look at the code and debugging crio activating the pprof (–profile flag), it seems to be the status of the container is locked when it is sent to be deleted here.
1 @ 0x55b008e939d6 0x55b008ea37d2 0x55b00a483d31 0x55b00a48453c 0x55b00a47b922 0x55b00a4c6b31 0x55b00a528c5e 0x55b00a54ec86 0x55b0098b83fb 0x55b009c5cd43 0x55b00a55003a 0x55b00a316e12 0x55b00a55003a 0x55b00a54fedf 0x55b0098b82b8 0x55b009886c0f 0x55b00988a9ea 0x55b009884738 0x55b008ec4e01
# 0x55b00a483d30 github.com/cri-o/cri-o/internal/oci.WaitContainerStop+0x270 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/internal/oci/runtime_oci.go:685
# 0x55b00a48453b github.com/cri-o/cri-o/internal/oci.(*runtimeOCI).StopContainer+0x37b /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/internal/oci/runtime_oci.go:771
# 0x55b00a47b921 github.com/cri-o/cri-o/internal/oci.(*Runtime).StopContainer+0x61 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/internal/oci/oci.go:326
# 0x55b00a4c6b30 github.com/cri-o/cri-o/internal/lib.(*ContainerServer).StopContainer+0x50 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/internal/lib/stop.go:14
# 0x55b00a528c5d github.com/cri-o/cri-o/server.(*Server).StopContainer+0x29d /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/server/container_stop.go:34
# 0x55b00a54ec85 github.com/cri-o/cri-o/server/cri/v1.(*service).StopContainer+0x25 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/server/cri/v1/rpc_stop_container.go:12
# 0x55b0098b83fa k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_StopContainer_Handler.func1+0x7a /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/k8s.io/cri-api/pkg/apis/runtime/v1/api.pb.go:9019
# 0x55b009c5cd42 github.com/cri-o/cri-o/internal/log.UnaryInterceptor.func1+0xc2 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/internal/log/interceptors.go:56
# 0x55b00a550039 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
# 0x55b00a316e11 github.com/cri-o/cri-o/server/metrics.UnaryInterceptor.func1+0xb1 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/server/metrics/interceptors.go:24
# 0x55b00a550039 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
# 0x55b00a54fede github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1+0xbe /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34
# 0x55b0098b82b7 k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_StopContainer_Handler+0x137 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/k8s.io/cri-api/pkg/apis/runtime/v1/api.pb.go:9021
# 0x55b009886c0e google.golang.org/grpc.(*Server).processUnaryRPC+0xcce /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/google.golang.org/grpc/server.go:1282
# 0x55b00988a9e9 google.golang.org/grpc.(*Server).handleStream+0xa29 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/google.golang.org/grpc/server.go:1616
# 0x55b009884737 google.golang.org/grpc.(*Server).serveStreams.func1.2+0x97 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/google.golang.org/grpc/server.go:921
After that, the goroutine associated with the crictl commands try to get the same lock but since the other goroutine has acquired it, the gouroutine gets stuck
2 @ 0x55b008e939d6 0x55b008ea48cc 0x55b008ea48a6 0x55b008ec0b05 0x55b00a47817d 0x55b00a478153 0x55b00a5286b1 0x55b00a52838f 0x55b00a54f106 0x55b00a3da1b8 0x55b009c5cd43 0x55b00a55003a 0x55b00a316e12 0x55b00a55003a 0x55b00a54fedf 0x55b00a3da078 0x55b009886c0f 0x55b00988a9ea 0x55b009884738 0x55b008ec4e01
# 0x55b008ec0b04 sync.runtime_SemacquireMutex+0x24 /usr/lib64/go-1.17/src/runtime/sema.go:71
# 0x55b00a47817c sync.(*RWMutex).RLock+0x5c /usr/lib64/go-1.17/src/sync/rwmutex.go:63
# 0x55b00a478152 github.com/cri-o/cri-o/internal/oci.(*Container).State+0x32 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/internal/oci/container.go:372
# 0x55b00a5286b0 github.com/cri-o/cri-o/server.(*Server).createContainerInfo+0xf0 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/server/container_status.go:127
# 0x55b00a52838e github.com/cri-o/cri-o/server.(*Server).ContainerStatus+0x7ee /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/server/container_status.go:104
# 0x55b00a54f105 github.com/cri-o/cri-o/server/cri/v1alpha2.(*service).ContainerStatus+0x25 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/server/cri/v1alpha2/rpc_container_status.go:14
# 0x55b00a3da1b7 k8s.io/cri-api/pkg/apis/runtime/v1alpha2._RuntimeService_ContainerStatus_Handler.func1+0x77 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2/api.pb.go:9082
# 0x55b009c5cd42 github.com/cri-o/cri-o/internal/log.UnaryInterceptor.func1+0xc2 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/internal/log/interceptors.go:56
# 0x55b00a550039 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
# 0x55b00a316e11 github.com/cri-o/cri-o/server/metrics.UnaryInterceptor.func1+0xb1 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/server/metrics/interceptors.go:24
# 0x55b00a550039 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
# 0x55b00a54fede github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1+0xbe /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34
# 0x55b00a3da077 k8s.io/cri-api/pkg/apis/runtime/v1alpha2._RuntimeService_ContainerStatus_Handler+0x137 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2/api.pb.go:9084
# 0x55b009886c0e google.golang.org/grpc.(*Server).processUnaryRPC+0xcce /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/google.golang.org/grpc/server.go:1282
# 0x55b00988a9e9 google.golang.org/grpc.(*Server).handleStream+0xa29 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/google.golang.org/grpc/server.go:1616
# 0x55b009884737 google.golang.org/grpc.(*Server).serveStreams.func1.2+0x97 /home/abuild/rpmbuild/BUILD/cri-o-1.23.4/vendor/google.golang.org/grpc/server.go:921
CRI-O and Kubernetes version
# crio --version
crio version 1.23.4
Version: 1.23.4
GitCommit: a6a1e6ebf4c4baca7ca7d57f069a4e0cd89056cd
GitTreeState: clean
BuildDate: 2022-10-21T16:26:30Z
GoVersion: go1.17.5
Compiler: gc
Platform: linux/amd64
Linkmode: dynamic
BuildTags: exclude_graphdriver_devicemapper, seccomp
SeccompEnabled: true
AppArmorEnabled: false
# kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.12", GitCommit:"c6939792865ef0f70f92006081690d77411c8ed5", GitTreeState:"clean", BuildDate:"2022-09-21T12:20:29Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.12", GitCommit:"c6939792865ef0f70f92006081690d77411c8ed5", GitTreeState:"clean", BuildDate:"2022-09-21T12:13:07Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}
OS version
# On Linux:
$ cat /etc/os-release
[root@sc4-r14-u10 ~]# cat /etc/os-release
NAME="Oracle Linux Server"
VERSION="8.6"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="8.6"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Oracle Linux Server 8.6"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:8:6:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://bugzilla.oracle.com/"
ORACLE_BUGZILLA_PRODUCT="Oracle Linux 8"
ORACLE_BUGZILLA_PRODUCT_VERSION=8.6
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=8.6
# uname -a
Linux sc4-r14-u10 4.18.0-372.32.1.0.1.el8_6.x86_64 #1 SMP Tue Oct 25 10:51:38 PDT 2022 x86_64 x86_64 x86_64 GNU/Linux
Additional environment details (AWS, VirtualBox, physical, etc.)
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 18 (9 by maintainers)
yes I can begin the process. the main version has merged, so I’ll start cherry-picking it back
to fix the stats piece though, you can set the
stats_collection_period
to something other than0
and it will collect asynchronously. then, when you callcrictl stats
it will return cached information and not gather on demand, which requires taking the container’sopLock
, which is held during termination