tests: CI: Sanity check fails after the CI run ends.

21:34:43 Ran 214 of 234 Specs in 1886.406 seconds
21:34:43 SUCCESS! -- 214 Passed | 0 Failed | 0 Pending | 20 Skipped PASS
21:34:43 
21:34:43 Ginkgo ran 1 suite in 40m6.048808247s
21:34:43 Test Suite Passed
21:34:43 bash sanity/check_sanity.sh
21:34:44 ERROR: 4 pods left and found at /var/lib/vc/sbs
21:34:44 make: *** [Makefile:51: docker] Error 1
21:34:44 Build step 'Execute shell' marked build as failure
21:34:44 Performing Post build task...
21:34:44 Match found for :.* : True
21:34:44 Logical operation result is TRUE
21:34:44 Running script  : #!/bin/bash

Context: https://github.com/kata-containers/proxy/pull/141#issuecomment-461093056

On the Jenkins Slave I could find:

jenkins@kata1:~$ sudo ls -lrt /var/lib/vc/sbs
total 16
drwxr-x--- 3 root root 4096 Feb  6 21:06 0863fbd4febc35edcdccb3923ba5e95479865765d60edbf44d98df482ff64733
drwxr-x--- 3 root root 4096 Feb  6 21:11 8a17f2c890d363ac98839389588dad7bb071bf720510b8a79ab382da81ef8d9c
drwxr-x--- 3 root root 4096 Feb  6 21:11 e3bc0a6de9e1db3739ef400540360313d4abc7bf797f871343f18538075b49a7
drwxr-x--- 3 root root 4096 Feb  6 21:25 1a862a76df0752154f26bf8cd97d047685885971adec70b91769d6d1b39f8ca0

From the CI Logs, some of these seem to be originating from tests: cc @GabyCT

docker run --cidfile /tmp/cid715320257/IWHOTNSj36wMY6ES2X9AMESZY30fN6 --runtime kata-runtime --name IWHOTNSj36wMY6ES2X9AMESZY30fN6 -dt busybox sh -c trap "exit 29" 29; echo TRAP_RUNNING; while :; do sleep 1; done

Side note: When I ran the test locally, I could not send a kill signal to it kill -s as it had already exited. However it did not leave any files in /var/lib/vc/sbs after exiting.

I am opening this issue to document my findings and as I am not sure how to stop leaky pods in the CI runs.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 21 (21 by maintainers)

Most upvoted comments

@nitkon Yes, I agree with you. If there is no vm to stop (aka, failing to establish qmp connections) in stopSandbox, we should go ahead and clean up the rest of sandbox artifacts on the host. It can happen if e.g., qemu process crashed etc and we should clean things up properly in such cases.

bergwolf on Feb 18, 2019

/me thinks @nitkon will need a 🔍 here 😉

grahamwhaley on Feb 8, 2019

Right, np. Heh, when searching the logs you might have to do some deeper investigation… that UUID in the sbs dir is probably not the container or pod ID - but, if you go and look inside that directory/folder/file maybe you will find something like the .json file, and inside that it might reference the container or pod UUID, and then you can track it in the logs. Or, maybe you can look in the Jenkins attached fragments, which I think will container the full debug logs from journald, and thus they might give you say the qemu cmdline that does use that sbs UUID, and thus you can relate that back to which test was running. I know, slightly more complex and some digging, but, I think it is the only way (as we do not print sbs UUIDs in the stdout logs etc.)

grahamwhaley on Feb 8, 2019