rke: Canal containers give selinux related error message

RKE version: 0.3.0

Docker version: (docker version,docker info preferred)

Client: Docker Engine - Community
 Version:           19.03.3
 API version:       1.39 (downgraded from 1.40)
 Go version:        go1.12.10
 Git commit:        a872fc2f86
 Built:             Tue Oct  8 00:58:10 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.1
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.6
  Git commit:       4c52b90
  Built:            Wed Jan  9 19:06:30 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Docker daemon.json:

{
  "selinux-enabled": true,
  "userland-proxy": false,
  "bip": "10.10.0.1/24",
  "fixed-cidr": "10.10.0.1/24"
}

Operating system and kernel: (cat /etc/os-release, uname -r preferred) NAME=“Red Hat Enterprise Linux” VERSION=“8.0 (Ootpa)” ID=“rhel” ID_LIKE=“fedora” VERSION_ID=“8.0” PLATFORM_ID=“platform:el8” PRETTY_NAME=“Red Hat Enterprise Linux 8.0 (Ootpa)” ANSI_COLOR=“0;31” CPE_NAME=“cpe:/o:redhat:enterprise_linux:8.0:GA” HOME_URL=“https://www.redhat.com/” BUG_REPORT_URL=“https://bugzilla.redhat.com/

REDHAT_BUGZILLA_PRODUCT=“Red Hat Enterprise Linux 8” REDHAT_BUGZILLA_PRODUCT_VERSION=8.0 REDHAT_SUPPORT_PRODUCT=“Red Hat Enterprise Linux” REDHAT_SUPPORT_PRODUCT_VERSION=“8.0”

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Doesn’t matter

cluster.yml file:

cluster_name: name

nodes:
  - address: node1
    user: user
    ssh_key_path: /home/user/.ssh/id_rsa
    role: [controlplane,etcd,worker]
  - address: node2
    user: user
    ssh_key_path: /home/user/.ssh/id_rsa
    role: [controlplane,etcd,worker]
  - address: node3
    user: user
    ssh_key_path: /home/user/.ssh/id_rsa
    role: [controlplane,etcd,worker]

private_registries:
  - url: internal-registry
    is_default: true # All system images will be pulled using this registry. 

services:
  etcd:
    snapshot: true
    creation: 6h
    retention: 24h

Steps to Reproduce: rke up When the cluster is built, I see problemens with canal pods:

kubectl -n kube-system get pods
NAME                                      READY   STATUS                  RESTARTS   AGE
canal-9vg2d                               1/2     Running                 0          45h
canal-ftfrv                               0/2     Init:CrashLoopBackOff   197        16h
canal-l5g2d                               2/2     Running                 0          147m
coredns-5c98fc7769-wbscd                  0/1     CrashLoopBackOff        487        45h
coredns-autoscaler-64c857cf7-qgqwc        1/1     Running                 0          167m
metrics-server-7cf4dfc846-2vvbl           1/1     Running                 34         167m
rke-coredns-addon-deploy-job-kn952        0/1     Completed               0          45h
rke-ingress-controller-deploy-job-f29cv   0/1     Completed               0          45h
rke-metrics-addon-deploy-job-hfsxx        0/1     Completed               0          45h
rke-network-plugin-deploy-job-lfnj4       0/1     Completed               0          45h

Looking into the cni-install pod, I see this error message:

mv: inter-device move failed: '/calico.conf.tmp' to '/host/etc/cni/net.d/10-canal.conflist'; unable to remove target: Permission denied
Failed to mv files. This may be caused by selinux configuration on the host, or something else.

Results: Cluster doesn’t work properly. Setting selinux to permissive is not really an option.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 15 (6 by maintainers)

Most upvoted comments

From the discussion in projectcalico/calico#2704 it seems that

securityContext:
  privileged: true

is needed in order to properly handle SELinux systems. Thus, I edited the running canal daemonset with kubectl -n kube-system edit daemonset/canal and added those lines to the init container named install-cni.

After saving, the pods immediately reached the running state, and no more errors were logged. Maybe this suggests that those lines are missing in the template?

Success using v1.1.0-rc11 and K8s1.15.10-rancher1-2 on CentOS 7.7 with enforcing SELinux! Note, however:

  • I had to use the CoreDNS images from v1.16.7-rancher1-2, rancher/coredns-coredns:1.6.2, instead of rancher/coredns-coredns:1.3.1 because the latter was failing with an error (pod logs reported that the --nodelabel option was incorrect, and I assumed that it was introduced later.)
  • I had to specify the calico_flexvol and canal_flexvol images in the config.yml because the nodes were trying to get them from the Internet, not sure why (that failed because this is an air-gapped setup.) I used the values from v1.16.7-rancher1-2.

Next test is upgrading from 1.15.5 to 1.15.10, and will report back in this very comment to avoid further noise.

EDIT: A cluster upgrade into 1.15.10 from 1.15.5 was successful! The canal pods are privileged and running properly.

@carloscarnero If you can test this change on some lab machines which are identical to the ones that were exhibiting the problem, that would be appreciated

@superseb I’m not clear what I should test. I mean… should I use rke v1.1.0-rc11? If that’s the case, should I test against one of that version’s supported K8s?

EDIT: based on the previous comment, I will test with v1.1.0-rc11 and K8s1.15.10-rancher1-2. The operating system is CentOS 7.7, completely updated, with SELinux enabled and enforcing. This will take some time because all my setups are air-gapped and I have to prime the internal registry.

@leodotcloud I have tried the fix above in another different cluster, and it seems to work.

Whil trying to reproduce the problem using a couple of different cloud providers, I see that ip_tables module is not loaded by default in RHEL8/CentOS 8 VMs.

[root@ip-172-31-16-240 ~]# lsmod | grep ip_tables
[root@ip-172-31-16-240 ~]#

This is causing problems with the install. Running modprobe ip_tables enables the modules and the installation goes through fine with ‘Enforcing’ setting.

@nheinemans and @carloscarnero could you check if this step resolves your problem?