rancher: Rancher Cattle Cluster Agent Could not Resolve Host

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

Provision new HA cluster using RKE. The cluster.yml file should only have the nodes stanza. 3 nodes with all roles assigned. Delete one of the 3 rancher pods.

root@massimo-server:~# kubectl delete pod/rancher-6dc68bb996-95rbw -n cattle-system

Result:

The rancher pod is recreated due to the ReplicaSet but the cattle cluster agent fails and goes into CrashLoopBackOff state.

root@massimo-server:~# kubectl logs --follow \
  pod/cattle-cluster-agent-7bcbf99f56-vdrbs -n cattle-system

INFO: Environment: CATTLE_ADDRESS=10.42.1.5 CATTLE_CA_CHECKSUM= CATTLE_CLUSTER=true CATTLE_INTERNAL_ADDRESS= CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-7bcbf99f56-vdrbs CATTLE_SERVER=https://massimo.rnchr.nl
INFO: Using resolv.conf: nameserver 10.43.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local options ndots:5
ERROR: https://massimo.rnchr.nl/ping is not accessible (Could not resolve host: massimo.rnchr.nl)

Using curl from outside of the cattle agent pod

root@massimo-server:~# curl https://massimo.rnchr.nl/ping
pong

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): rancher/rancher v2.1.1
  • Installation option (single install/HA): HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): RKE provisioned
  • Machine type (cloud/VM/metal) and specifications (CPU/memory): cloud, 2 CPU/4 GB
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-24T06:54:59Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
  • Docker (use docker info):
Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 17.03.2-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 5
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-138-generic
Operating System: Ubuntu 16.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.859 GiB
Name: massimo-server
ID: 747D:4N33:DFHF:IWOG:Z3RK:G22I:3663:ZSHQ:5HKF:SOGK:M3V4:EKQA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

gz#11051

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 50
  • Comments: 133 (1 by maintainers)

Most upvoted comments

Same problem, 7 clusters all set up in the same way, but 1 cluster get lookup errors. Solved this by editing the deployment and changing the DNS policy from ClusterFirst to Default:

kubectl edit deployment cattle-cluster-agent -n cattle-system
...
dnsPolicy: Default

Deleting the cattle cluster agent pod, fixes it. Why is the cattle cluster agent having issues resolving the host when a rancher pod is recreated?

Unfortunately this issue is still there in 2.4.5

I am managing 3 clusters (1 is local) using rancher and 2 out of 3 had this problem. I didn’t see this issue on my development cluster but both local and production had this issue (both of them are on AWS).

The workaround that we have applied is to edit the “cattle-cluster-agent” workload to give local IP for your rancher URL -

image

This worked like a charm for us and no more failure/restart of the cluster-agent were observed after that.

Hope this helps for you too !!!

I also have run into this issue after performing a restore from etcd snapshot. For me, cattle-cluster-agent is fine during cluster initialization, but after restoring from etcd snapshot, the cattle-cluster-agent will not be able to resolve the hostname.

It’s a bit confusing because I am unsure how it was able to do so during cluster initialization. It is as if the agent uses Default dnsPolicy during cluster initialization but then afterward, if it is ever rebooted, it resorts to the default configuration of ClusterFirst dnsPolicy.

As a workaround, I edit the cattle-cluster-agent workload after initializing such that my dns nameservers are configured (could also just set dnsPolicy to Default)

Would be nice to see this as a possible configuration when configuring cluster at initialization.

I have been having this issue as well. I think I have solved why this issue is happening and have found two temporary fixes for it. The real solution will be for rancher to change their deployment logic or to not specify a default cluster domain name.

This issue, at least for me has been being caused by specifying a domain in the rke config file like: services: kubelet: cluster_domain: mycluster.com

The problem to the fact that the cattle-cluster-agent pod is querying kube-dns to resolve the name to your rancher cluster (let us say it is rancher.mycluster.com). Kube-dns does not have an entry for rancher.mycluster.com, and thanks to the cluster_domain specification in the RKE config file, believes it is authoritative for mycluster.com and answers cattle-cluster-agent that there are no records for that name. This causes the cattle-cluster-agent pod to bootloop and fail.

To fix this you can change the yml for cattle-cluster-agent and either edit the dnspolicy or the value the CATTLE_SERVER variable provides to the pod.

dnspolicy If you change the dnspolicy to Default from ClusterFirst this will cause the pod to go to external dns servers (not kube-dns) for name resolution. This makes resolution of rancher.mycluster.com work, however this will break attempts by the cattle-cluster-agent pod to contact other pods (I do not know if it needs to do so)

CATTLE_SERVER The other fix for this is to edit the yml entry for CATTLE_SERVER and change the value from https://rancher.mycluster.com to https://rancher. This make the pod send a request to kube-dns for resolution of “rancher” which it does have an entry for, as that is the pod name which is inserted into kube-dns thanks to the magic of kubernetes, resulting in a working pod.

Long term rancher should probably make the second fix presented here the default value for CATTLE_SERVER when it comes up by default. That should prevent this issue from occurring.

以下转自这个 issues #18832 (comment)

Adding host alias to the cattle agent can be done in this way:


kubectl -n cattle-system patch  deployments cattle-cluster-agent --patch '{
    "spec": {
        "template": {
            "spec": {
                "hostAliases": [
                    {
                      "hostnames":
                      [
                        "{{ rancher_server_hostname }}"
                      ],
                      "ip": "{{ rancher_server_ip }}"
                    }
                ]
            }
        }
    }
}'

kubectl -n cattle-system patch  daemonsets cattle-node-agent --patch '{
 "spec": {
     "template": {
         "spec": {
             "hostAliases": [
                 {
                    "hostnames":
                      [
                        "{{ rancher_server_hostname }}"
                      ],
                    "ip": "{{ rancher_server_ip }}"
                 }
             ]
         }
     }
 }
}'

I almost gave up, this saved me as well, thanks!

Solved this by editing the deployment,setting hostNetwork to true and changing the DNS policy from ClusterFirst to ClusterFirstWithHostNet:

kubectl edit deployment cattle-cluster-agent -n cattle-system

hostNetwork: true dnsPolicy: ClusterFirstWithHostNet

I have the same issue on some clusters while others work perfectly fine while they are set up exactly the same and they are using the same rancher2 server. The server is located on external servers using a normal, official domain name (google’s 8.8.8.8 resolves it for example). cattle-cluster-agent can still not resolve the hostname, while the cluster internal kube-dns can resolve it without problems, as well as the host system. When I switch dnsPolicy to “None” and manually add an external dns like this - everything works fine:

dnsPolicy: "None"
      dnsConfig:
        nameservers:
          - 8.8.8.8
          - 8.8.4.4

I’ve also noticed, that Events shows (no matter if it’s working or not) Scaled down replica set cattle-cluster-agent-128cfc8df7 to 0 although state is “running” and I can execute a shell in the docker. So it’s not scaled to 0.

以下转自这个 issues https://github.com/rancher/rancher/issues/18832#issuecomment-547728856

Adding host alias to the cattle agent can be done in this way:


kubectl -n cattle-system patch  deployments cattle-cluster-agent --patch '{
    "spec": {
        "template": {
            "spec": {
                "hostAliases": [
                    {
                      "hostnames":
                      [
                        "{{ rancher_server_hostname }}"
                      ],
                      "ip": "{{ rancher_server_ip }}"
                    }
                ]
            }
        }
    }
}'

kubectl -n cattle-system patch  daemonsets cattle-node-agent --patch '{
 "spec": {
     "template": {
         "spec": {
             "hostAliases": [
                 {
                    "hostnames":
                      [
                        "{{ rancher_server_hostname }}"
                      ],
                    "ip": "{{ rancher_server_ip }}"
                 }
             ]
         }
     }
 }
}'

Same problem, 7 clusters all set up in the same way, but 1 cluster get lookup errors. Solved this by editing the deployment and changing the DNS policy from ClusterFirst to Default:

kubectl edit deployment cattle-cluster-agent -n cattle-system
...
dnsPolicy: Default

Thank you Bro That’s really worked

Same problem here Imported EKS cluster: Rancher 2.4.12 Ubuntu 20 Docker 19.03.8 Log for cattle-cluster-agent

INFO: Environment: CATTLE_ADDRESS=10.50.6.156 CATTLE_CA_CHECKSUM= CATTLE_CLUSTER=true CATTLE_FEATURES= CATTLE_INTERNAL_ADDRESS= CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-754b8c6c8d-gq8wb CATTLE_SERVER=https://rch2.vetpro.us
INFO: Using resolv.conf: nameserver 172.20.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local vetpro.us options ndots:5
ERROR: https://rch2.vetpro.us/ping is not accessible (Could not resolve host: rch2.vetpro.us)

I managed to solve the problem by doing the following procedure.

systemctl disable firewalld systemctl stop firewalld iptables -P FORWARD ACCEPT modprobe br_netfilter sysctl net.ipv4.ip_forward sysctl net.bridge.bridge-nf-call-iptables=1 sysctl --system sudo iptables -P FORWARD ACCEPT

set paramater : vi /etc/sysctl.d/50-docker-forward.conf

** net.ipv4.ip_forward = 1 net.bridge.bridge-nf-call-ip6tables = 1 net.bridge.bridge-nf-call-iptables = 1 **

for mod in ip_tables ip_vs_sh ip_vs ip_vs_rr ip_vs_wrr; do sudo modprobe $mod; echo $mod | sudo tee -a /etc/modules-load.d/iptables.conf; done

sudo systemctl enable network sudo systemctl disable NetworkManager

reboot;

this is mine (fix)

rancher server : 192.168.110.210 rancher.t.org

image

image

Edit ( button)

image

click save button

then return the Workload: cattle-cluster-agent

image

Success

Hi Having the same issue on a CentOS Linux release 7.7.1908 (Core). Nameserver in the cattle-cluster-agent is set to nameserver 10.43.0.10 when using the ClusterFirst.

Kernel version: 3.10.0-1062.9.1.el7.x86_64

RKE version: v1.0.4

Rancher version: v2.3.5

Kubernetes version:

Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}

Docker version:

Client: Docker Engine - Community
 Version:           19.03.5
 API version:       1.40
 Go version:        go1.12.12
 Git commit:        633a0ea
 Built:             Wed Nov 13 07:25:41 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.5
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.12
  Git commit:       633a0ea
  Built:            Wed Nov 13 07:24:18 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.10
  GitCommit:        b34a5c8af56e510852c35414db4c1f4fa6172339
 runc:
  Version:          1.0.0-rc8+dev
  GitCommit:        3e425f80a8c931f88e6d94a8c831b9d5aa481657
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Logs cattle-cluster-agent:

INFO: Using resolv.conf: nameserver 10.43.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local options ndots:5
ERROR: https://rancher-tooling/ping is not accessible (Could not resolve host: rancher-tooling)

Workaround Changed the dnsPolicy from the cattle-cluster-agent deployment: From:

dnsPolicy: ClusterFirst

To:

dnsPolicy: Default

Regards Matthias

I had exactly the same issue with Ubuntu server 18.04 LTS and docker 18.06.1-ce. Cluster was created with rke. Deleting and recreating the cattle-cluster-agent pod did not help.

I set the CATTLE_SERVER env of cattle-cluster-agent to the configured rancher ingress service, so kube-dns can resolve it: rancher.cattle-system.svc.cluster.local or <service>.<namespace>.<cluster> Afterwards the pod started, but I don’t know if this is correct. Normally you would go over external LB? But in my case this did not work because my cluster dns domain overlaps with dns domain outside in real network…

Solved this by editing the deployment,setting hostNetwork to true and changing the DNS policy from ClusterFirst to ClusterFirstWithHostNet:

kubectl edit deployment cattle-cluster-agent -n cattle-system

hostNetwork: true dnsPolicy: ClusterFirstWithHostNet

works for me:

dnsPolicy: Default hostNetwork: true

Thanks !!

Hi all, I had faced the same issue and I have resolved it by, going to the namespace “cattle-system” in the Kubernetes cluster you provisioned, edit the deployment “cattle-cluster-agent” and click on show advanced options, and expand networking, and under "Host Aliases ( /etc/hosts entries ) maintain the IP address ( load balancer IP or any node IP of your rke cluster on which rancher server is deployed ) and DNS name of your rancher server endpoint.

The “cattle-cluster-agent” will change status from “crashloopbackoff” to running status. screenshot1 screenshot2

When I deployed a port with nodeport , nodeport wasn’t generated as the “cattle-cluster-agent” wasn’t in running status. As soon as I made above change, the nodeport was immediately visible under “netstat -tnlup”

Attached couple of screenshots that may help you.

I have found a way to apply the ‘hack’ without the use of kubectl.

  1. ssh to your masters, and find the one that is attempting to run k8s_cluster-register_cattle-cluster-agent-****** use docker ps -a
  2. docker container inspect <container> and look for this segment
      "Mounts": [
           .....
           {
               "Type": "bind",
               **"Source": "/var/lib/kubelet/pods/f179aff5-b149-4cbc-8d0f-09f1ca0d54e7/etc-hosts",**
               "Destination": "/etc/hosts",
               "Mode": "Z",
               "RW": true,
               "Propagation": "rprivate"
           },
    .....
  1. open that source file with a text editor, and add the host entry for your rancher.example.com

  2. restart the container At this point, your cluster should continue putting itself together.

But this does not address the underlying issue regarding coredns. What I am currently testing out is adding a config to coredns which declares our dns server in it. And it seems to pick up the slack for that.

But some internal communications remain broken. For instance, for a redis deployment redis replica can not connect to redis master over ClusterIP. The cause for my woes is attempting to get iptables+firewalld to behave with k8s 1.19 and centos 8.3. Your issue may be different.

TLDR; the ‘hack’ only addresses a side effect of the root cause, not the actual issue itself, unfortunately 😢

I got the same error where it could not resolve the host. However in my situation it happens during cluster creation, so I cannot use kubectl since the cluster is not ready yet.

What can I do to fix this issue (without kubectl)?

Unbelievable that this is not resolved yet. I faced this issue more than 1 year back (I don’t remember how I resolved back then). Now I am installing rancher 2.4 on a brand new Windwos 10 (docker-desktop) and Im facing the exact same issue.

Im installing using helm, and I cant find a way to use the workaround:

Workaround Changed the dnsPolicy from the cattle-cluster-agent deployment: From:

dnsPolicy: ClusterFirst To:

dnsPolicy: Default

same issue

Rancher 2.4.4 Kubernetes Version: v1.18.3 minikube 1.11 Docker 19.03.8

Could this be related to the CNI used in the downstream clusters ?

I’m running Rancher 2.7.0 with Docker Compose :

version: '3.8'

services:
  rancher:
    privileged: true
    container_name: rancher
    image: rancher/rancher
    ports:
      - "80:80"
      - "443:443"
    environment:
      - TZ=Europe/Paris
    volumes:
      - rancher_cni:/var/lib/cni
      - rancher_kubelet:/var/lib/kubelet
      - rancher_data:/var/lib/rancher
      - rancher_logs:/var/log
    restart: unless-stopped

volumes:
  rancher_cni:
  rancher_kubelet:
  rancher_data:
  rancher_logs:

My test environment is based on K3s clusters :

  • Flannel (K3s’default) CNI
# curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=stable INSTALL_K3S_VERSION=v1.24.7+k3s1 \
        K3S_TOKEN="*************" sh -s - server --cluster-init

No issue when importing the cluster into Rancher.

  • Antrea (1.9.0) CNI
# curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=stable INSTALL_K3S_VERSION=v1.24.7+k3s1 \
        K3S_TOKEN="*************" sh -s - server --cluster-init \
        --flannel-backend=none --disable-network-policy
# repo add antrea https://charts.antrea.io
# helm repo update antrea
# helm install antrea antrea/antrea --namespace kube-system

No issue when importing the cluster into Rancher.

  • Cilum (1.12.4) CNI
# curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=stable INSTALL_K3S_VERSION=v1.24.7+k3s1 \
        K3S_TOKEN="*************" sh -s - server --cluster-init \
        --flannel-backend=none --disable-network-policy
# helm repo add cilium https://helm.cilium.io
# helm repo update cilium
# helm install cilium cilium/cilium --set operator.replicas=1

No issue when importing the cluster into Rancher.

  • Calico (3.24.5) CNI
# curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=stable INSTALL_K3S_VERSION=v1.24.7+k3s1 \
        K3S_TOKEN="*************" sh -s - server --cluster-init \
        --flannel-backend=none --disable-network-policy
# helm repo add projectcalico https://projectcalico.docs.tigera.io/charts
# helm repo update projectcalico
# kubectl create namespace tigera-operator
# helm install calico projectcalico/tigera-operator --namespace tigera-operator

When importing the cluster into Rancher :

INFO: Using resolv.conf: search home.lab nameserver 192.168.3.254
ERROR: https://rancher.foo.com/ping is not accessible (Could not resolve host: rancher.foo.com)

Resolved by adding this to the cattle-cluster-agent deployment :

hostNetwork: true

I did not have to modify the default DNS policy :

dnsPolicy: ClusterFirst                   

been 3 years and issue still surface once in a while, depending on DNS setup (CNAME or A records, *. prefix etc) - fact is something in cattle-agent code does not resolve DNS properly, while at the same time other containers on the same node resolves it fine

I’m able to fix issue on rhel8 after running below commands,

` sudo iptables -P FORWARD ACCEPT

echo ‘net.ipv4.ip_forward = 1’ | sudo tee -a /etc/sysctl.d/50-docker-forward.conf

for mod in ip_tables ip_vs_sh ip_vs ip_vs_rr ip_vs_wrr; do sudo modprobe $mod; echo $mod | sudo tee -a /etc/modules-load.d/iptables.conf; done

sudo dnf -y install network-scripts

sudo systemctl enable network

sudo systemctl disable NetworkManager`

Got this from here https://github.com/superseb/rancher/commit/cacc5837ab50db42b078f32ac236dc3367130db1

On Debian 10, this issue is caused by the change from iptables to nftables.

This is resolved with:

update-alternatives --set iptables /usr/sbin/iptables-legacy
reboot

Regards Jason

We are having this issue too after an upgrade to k8s 1.16.4 Our seemingly to work workaround was to setup a hostalias:

      hostAliases:
      - ip: "xx.xx.xx.xx"
        hostnames:
        - "my.host.name"

This ended up being DNS for me as well - the cattle-cluster-agent was constantly rebooting and not allowing for items like ingress to work properly. I had tried the dnsPolicy setting as suggested, tried putting in entries in the hosts file, tried variations on the host OS, Kubernetes version, and Rancher version, all to no avail.

What did work was setting up DNS “properly.” Using the “roll your own / Custom” option in the Rancher Cluster Creation, not specifying any additional settings (such as private /public addresses), on a greenfield Ubuntu 18.04 setup (with both ipv4 and ipv6 addresses) / Rancher 2.3.3 / Kubernetes 1.16, I started by setting a DNS ‘search’ setting (in the netplan yaml, ethernets/eth0/nameservers/search) for all of the boxes in the mix. I then stood up an internal DNS server with a DNS zone to match, and then ensured that all of the boxes had host (A/AAAA) entries for both IPv4 and IPv6. With that in place, everything magically started working. Did NOT adjust the dnsPolicy setting in the end.

Dear @armanriazi, did you find the solution for it?

I have the same problem.

@onkar-dhuri please help out on this. And if anyone from Rancher is reading this… This is a problem so many people have had for a really long time. Can you dedicate a page in your docs to addresses thia 1 particular problem. It’s a humble request. Thanks.

still here

Same issue.

I am using rancher v2.4.3 with latest RKE v1.2.0-rc1 and docker 19.03.9

Still there in v2.4.3 as well.

I ran into this issue when I used a pfsense firewall as a load balancer in front of my cluster. I was able to resolve it by changing my outbound NAT as recommended here in the pfsense docs:

https://docs.netgate.com/pfsense/en/latest/book/loadbalancing/troubleshooting-server-load-balancing.html#unable-to-reach-a-virtual-server-from-a-client-in-the-same-subnet-as-the-pool-server

Essentially the fact that my nodes were on the same subnet as my load balancer caused traffic to not route properly because of incorrect source and destination IPs. This led to the failure in deployment described in this issue even though I was able to use my cluster for the most part normally otherwise. It may not necessarily represent what others have seen here, but could be a potential item to look at for others.

Note: This change does result in you losing some information regarding what host traffic originates from in traffic that hits your load balancer. This was not important to me so I did not worry about that loss of information, but others may.

Restart all worker and master nodes one by one then finally restart Rancher installed VM helped to resolve issue for us. It’s due to network calico update caused for us.

Hi folks,

I have faced same issue and I just solved after patched as follows: kubectl -n cattle-system patch deployments cattle-cluster-agent --patch ‘{“spec”: {“template”: {“spec”: {“hostAliases”: [{“hostnames”:[“<Rancher Hostname>”],“ip”: “<Rancher IP>”}]}}}}’

Hope that it works for you as well.

Cheers. Bruno Almeida

I think that there are several possible reasons, but just to add one possible solution: allow masquerading:

sudo firewall-cmd --zone=public --add-masquerade --permanent

There is some discussion on reference: https://stackoverflow.com/a/61256515/17052885

The background of that is that some (not all?) CNIs will use masquerade rules when doing SNAT in order to reach DNS server.

And this in my setup didn’t require implementation of any other workarounds mentioned here.

I hit this issue sometimes with Rancher HA setups on AWS. Usually downstream clusters or even the local ha agent fails to resolve the name. Security policy is open for both incoming and outgoing network, No firewall on ubuntu image used for the RKE cluster setup.

In my case I don’t think is a firewall issue. If I change the DNS server in the cattle-cluster-agent to 1.1.1.1 i usually get the thing fixed.

I also ended up deploying a dns server in my local network.

Oh my God,I’m tired, I re-run it now and then with a virtualbox VM and it still doesn’t work the same way!

unbelievable

@brunobertechini , it’s really weird, whatever the provider aws, minikube, docker-desktop, I still have a host resolution error.

`INFO: Environment: CATTLE_ADDRESS=10.42.0.26 CATTLE_CA_CHECKSUM=04629d3294e6f84c25d055f13539346540170ad380a43375cbedf47f9aa5ca54 CATTLE_CLUSTER=true CATTLE_FEATURES=dashboard=true CATTLE_INTERNAL_ADDRESS= CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-6d85bc69d-fkx6r CATTLE_SERVER=https://k3s

INFO: Using resolv.conf: search home nameserver 192.168.64.1

ERROR: https://k3s/ping is not accessible (Could not resolve host: k3s)`

After changed the dnsPolicy from the cattle-cluster-agent deployment: From:

dnsPolicy: ClusterFirst To:

dnsPolicy: Default

I still got the same error from dns resolution from cattle-cluster-agent

https://k3s/ping is reachable by my browser i got a pong

still here

Workaround Changed the dnsPolicy from the cattle-cluster-agent deployment: From:

dnsPolicy: ClusterFirst To:

dnsPolicy: Default

I had this issue because I wanted to create master node with controlplane + etcd only. It seems that worker role is compulsory anyway. So I’ve created master node with the 3 roles and no more worries ! After I could create additional nodes with worker role only. Then I tainted my master node to prevent new pods to be deployed on it. rancher rke roles nodes

I think Rancher should be more explicit about roles requirements for RKE because it’s an outstanding key feature but the UX is discutable for custom node creation.

For me, changing the CATTLE_SERVER solves at first, but other places seem to point to the same address inside the pod.

I fixed it by adding a host alias (/etc/hosts entries) to point to the internal IP address for both cattle-cluster-agent and cattle-node-agent.

Looks like rancher should have an internal IP setting that it should use for this. In AWS, this causes additional cost unnecessarily.

Glad it at least partly worked for you @TheAifam5! Always good to hear a post helps someone.