rancher: Cannot execute shell on CentOS 7.2

Rancher Version: v1.2.0 Docker Version: 1.12.3 OS and where are the hosts located? (cloud, bare metal, etc): CentOS 7.2, Digitalocean (but also on-prem) Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB) single node rancher, internal DB Environment Type: (Cattle/Kubernetes/Swarm/Mesos) Cattle Steps to Reproduce:

  • Install Rancher by starting the rancher/server container, and join a rancher/agent container
  • Wait for infrastructure stack to become active, Execute shell on the scheduler container

Results: Popup opens and closes immediately Expected: Popup with a running shell

More info: Tried to debug this today. Haven’t gotten the root cause but filing a ticket early. Tried to reproduce on Ubuntu 16.04, it’s working fine there. This is what I get on the commandline:

[root@centos-02 ~]# docker ps | grep scheduler
a49f85151ae2        rancher/scheduler:v0.4.0                       "/.r/r scheduler"        14 minutes ago      Up 14 minutes                               r-scheduler-scheduler-1-5ed5a22e
[root@centos-02 ~]# docker exec -ti a49 bash
rpc error: code = 13 desc = invalid header field value "oci runtime error: exec failed: cannot exec a container that has run and stopped\n"

This was fixed in https://github.com/docker/docker/issues/27540 but still appears in this situation. As I dug a little deeper, I found out that only the containers with a Path of /.r/r seem to be affected. For a clean install this is:

22b7df37bf0c        rancher/healthcheck:v0.1.0                     "/.r/r /tini -- healt"   14 minutes ago      Up 14 minutes                           r-healthcheck-healthcheck-1-3ccae477
a49f85151ae2        rancher/scheduler:v0.4.0                       "/.r/r scheduler"        15 minutes ago      Up 15 minutes                           r-scheduler-scheduler-1-5ed5a22e
701043eecb1c        rancher/net:v0.7.5                             "/.r/r start.sh"         15 minutes ago      Up 13 minutes                           r-ipsec-ipsec-1-48f985f7

If you try to diagnose this the same way as in the Docker ticket, you get the same results. For instance for the scheduler:

[root@centos-02 ~]# docker ps | grep scheduler
a49f85151ae2        rancher/scheduler:v0.4.0                       "/.r/r scheduler"        17 minutes ago      Up 17 minutes                           r-scheduler-scheduler-1-5ed5a22e
[root@centos-02 ~]# ps -ef | grep a49f
root      3126  9121  0 19:40 pts/0    00:00:00 grep --color=auto a49f
root     11291  9280  0 19:23 ?        00:00:00 docker-containerd-shim a49f85151ae25475d1d77945f03e5e1862f1d964b1d5f03b02ea61b31af03628 /var/run/docker/libcontainerd/a49f85151ae25475d1d77945f03e5e1862f1d964b1d5f03b02ea61b31af03628 docker-runc
[root@centos-02 ~]# docker-runc state a49f85151ae25475d1d77945f03e5e1862f1d964b1d5f03b02ea61b31af03628
{
  "ociVersion": "1.0.0-rc2-dev",
  "id": "a49f85151ae25475d1d77945f03e5e1862f1d964b1d5f03b02ea61b31af03628",
  "pid": 0,
  "status": "stopped",
  "bundle": "/run/docker/libcontainerd/a49f85151ae25475d1d77945f03e5e1862f1d964b1d5f03b02ea61b31af03628",
  "rootfs": "/var/lib/docker/devicemapper/mnt/e506c366628e380183932d047e7d4b9e60af64d59e3ab977cf13576df7c1ae17/rootfs",
  "created": "2016-12-01T19:23:21.019392097Z"
}

Haven’t had the time to go deeper, I need a way to reproduce on vanilla Docker to open a new issue there, maybe someone can help out.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 26 (9 by maintainers)

Most upvoted comments

@superseb @tobowers For docker@1.3.0 we can update kernel to 3.10.0-514.6.1.el7.x86_64 first, it works for me.

For sake of completeness, I can successfully open shell with Rancher v1.2.2 onCentOS Linux release 7.3.1611 (Core) with stock kernel 3.10.0-514.2.2.el7.x86_64.

Ok quick update of something I just thought of checking out as it only happens on CentOS7. As it seems that there is some mismatch in /proc or in combination with runC lookups I wondered about kernel version. CentOS/RHEL always keep their base version the same, and backport updates. So I updated the 3.10 kernel to the kernel-ml from elrepo which contains 4.8.12-1.el7.elrepo.x86_64, rebooted and this fixed the problem. On containers I cannot exec into on 3.10 I can exec into on 4.8.12-1.el7.elrepo.x86_64. I know this is not a real solution for production machines but it might help in debugging to find the root cause, as I just thought of this now.