amazon-vpc-cni-k8s: Failing with SELinux Enabled
What happened:
RHEL 7 image with the EKS binaries installed. When joining the instance to the cluster the aws-node
init container successfully runs because it runs in privileged mode. The long-running daemon container fails to move and place files in the host directory.
# aws-node logs
"level":"info","ts":"2020-11-19T21:03:44.454Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
install: cannot remove '/host/opt/cni/bin/aws-cni': Permission denied
What you expected to happen:
# aws-node logs with privileged mode or running on non-selinux host
{"level":"info","ts":"2020-11-19T20:51:21.267Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2020-11-19T20:51:21.279Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}{"level":"info","ts":"2020-11-19T20:51:21.281Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}{"level":"info","ts":"2020-11-19T20:51:23.299Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2020-11-19T20:51:23.302Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}{"level":"info","ts":"2020-11-19T20:51:23.303Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}
How to reproduce it (as minimally and precisely as possible):
$ sestatus
SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode: enforcing
Mode from config file: enforcing
Policy MLS status: enabled
Policy deny_unknown status: allowed
Max kernel policy version: 31
Anything else we need to know?: When running in privileged mode, the daemon set functions properly.
securityContext:
privileged: true
capabilities:
add:
- NET_ADMIN
If the mounted host directories are configured with the container_file_t
label, then the CNI is able copy the files but is never able to communicate with the ipam-D agent:
{"level":"info","ts":"2020-11-19T20:51:21.267Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2020-11-19T20:51:21.279Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}{"level":"info","ts":"2020-11-19T20:51:21.281Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
Environment:
- Kubernetes version: v1.18.8-eks-7c9bda
- CNI Version: v0.8.6
- OS: RHEL 7.9
- Kernel: Linux ip-192-168-104-223.us-east-2.compute.internal 3.10.0-1160.6.1.el7.x86_64
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (7 by maintainers)
Quick update: I am able to replicate the issue on AL2 with
selinux-enabled
on docker daemon.Things that I have done so far to replicate the linux on AL2:
After that I see aws-node daemonSet crash with permissions issue.
Here, the host SELinux config isn’t to blame. The aws-node pod starts and as part of entrypoint script, we install the CNI binary and also copy 10-awslist.conf file and the written file carries the SELinux context of that container (random MCS pair). This is causing container to start as the host MCS pair is associated to a different user/group (probably)?
Quick work around this is to set
spc_t
selinuxOptions or run containers asprivileged: true
However, If you run as spc_t by default you’ll break current Bottlerocket releases since we don’t define that label.
very roughly SELinux is about answering questions like “can <subject> do <action> to <object>?” Where subjects are labels like
container_t
and objects are files likecontainer_file_t
. So the issue is that your CNI pod has thecontainer_t
subject label and is trying to create (or move into) a file in a directory with the object labelusr_t
or something like that. And the policy is set up socontainer_t
can’t do most file actions onusr_t
.spc_t
fixes it by changing the subject to one that’s allowed to do most actions. Relabeling the directory so it’scontainer_file_t
instead ofusr_t
fixes it by changing the object to one thatcontainer_t
subjects are allowed to act on.I am still working on the issue (with internal teams) to come up with better solution and update next steps here!
Our issue stemmed from comparing 2 “like” nodes and getting different results. Both were using SELinux enforcing, but we failed to catch that one node had selinux-enabled in docker and the other did not (I should have read @nithu0115 's response more closely on how he replicated). We are good now.
@nithu0115 - Reopening the issue since it is seen with AL2 extra.