kubernetes: Skydns throws IO errors on large cluster, slow query performance.

A prod cluster consisting of 40 nodes and ~1100 pods seems to be bottlenecked by DNS queries. Looking at the skydns container logs I see lots of io errors talking to etcd:

2016/01/14 02:09:45 skydns: error from backend: 501: All the given peers are not reachable (failed to propose on members [http://localhost:4001] twice [last error: Get http://localhost:4001/v2/keys/skydns/local/cluster/svc/default/com/foo/bar?quorum=false&recursive=true&sorted=false: dial tcp 127.0.0.1:4001: i/o timeout]) [0]
2016/01/14 02:09:45 skydns: error from backend: 501: All the given peers are not reachable (failed to propose on members [http://localhost:4001] twice [last error: Get http://localhost:4001/v2/keys/skydns/local/cluster/svc/default/com/foo/bar?quorum=false&recursive=true&sorted=false: dial tcp 127.0.0.1:4001: i/o timeout]) [0]
2016/01/14 02:09:45 skydns: error from backend: 501: All the given peers are not reachable (failed to propose on members [http://localhost:4001] twice [last error: Get http://localhost:4001/v2/keys/skydns/local/cluster/svc/default/com/foo/bar?quorum=false&recursive=true&sorted=false: dial tcp 127.0.0.1:4001: i/o timeout]) [0]
2016/01/14 02:31:14 skydns: failure to return reply "write udp: invalid argument"
2016/01/14 02:31:31 skydns: failure to return reply "write udp: invalid argument"
2016/01/14 02:31:45 skydns: failure to return reply "write udp: invalid argument"
2016/01/14 02:33:16 skydns: failure to return reply "write udp: invalid argument"
2016/01/14 02:33:21 skydns: failure to return reply "write udp: invalid argument"

We see that pods are — in /etc/resolv.conf — being given ndots:5 which would, if I’m not mistaken, generate five queries per external lookup (one for each entry in the search path, and a final, recursive query)? The large cluster docs indicate that scaling the size of the kube-dns pod is indicated here. I’ve confirmed that scaling out multiple kube-dns pods (with a kube2sky in each) causes high load on the docker daemon for each node in the cluster. However this issue still seems to be occurring with a scaled up pod (see below). Is there a misconfiguration here?

# sample pod /etc/resolv.conf
nameserver 10.100.100.100
nameserver 10.99.0.2
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5
{
  "apiVersion": "v1",
  "kind": "ReplicationController",
  "metadata": {
    "labels": {
      "k8s-app": "kube-dns",
      "kubernetes.io/cluster-service": "true",
      "version": "v9"
    },
    "name": "kube-dns-v9",
    "namespace": "kube-system"
  },
  "spec": {
    "replicas": 1,
    "selector": {
      "k8s-app": "kube-dns",
      "version": "v9"
    },
    "template": {
      "metadata": {
        "labels": {
          "k8s-app": "kube-dns",
          "kubernetes.io/cluster-service": "true",
          "version": "v9"
        }
      },
      "spec": {
        "containers": [
          {
            "command": [
              "/usr/local/bin/etcd",
              "-data-dir",
              "/var/etcd/data",
              "-listen-client-urls",
              "http://127.0.0.1:2379,http://127.0.0.1:4001",
              "-advertise-client-urls",
              "http://127.0.0.1:2379,http://127.0.0.1:4001",
              "-initial-cluster-token",
              "skydns-etcd"
            ],
            "image": "gcr.io/google_containers/etcd:2.0.9",
            "name": "etcd",
            "resources": {
              "limits": {
                "cpu": "1000m",
                "memory": "500Mi"
              }
            },
            "volumeMounts": [
              {
                "mountPath": "/var/etcd/data",
                "name": "etcd-storage"
              }
            ]
          },
          {
            "args": [
              "-domain=cluster.local"
            ],
            "image": "gcr.io/google_containers/kube2sky:1.11",
            "name": "kube2sky",
            "resources": {
              "limits": {
                "cpu": "200m",
                "memory": "100Mi"
              }
            }
          },
          {
            "args": [
              "-machines=http://localhost:4001",
              "-addr=0.0.0.0:53",
          "-ns-rotate=false",
              "-domain=cluster.local."
            ],
            "image": "gcr.io/google_containers/skydns:2015-10-13-8c72f8c",
            "livenessProbe": {
              "httpGet": {
                "path": "/healthz",
                "port": 8080,
                "scheme": "HTTP"
              },
              "initialDelaySeconds": 30,
              "timeoutSeconds": 5
            },
            "name": "skydns",
            "ports": [
              {
                "containerPort": 53,
                "name": "dns",
                "protocol": "UDP"
              },
              {
                "containerPort": 53,
                "name": "dns-tcp",
                "protocol": "TCP"
              }
            ],
            "readinessProbe": {
              "httpGet": {
                "path": "/healthz",
                "port": 8080,
                "scheme": "HTTP"
              },
              "initialDelaySeconds": 1,
              "timeoutSeconds": 5
            },
            "resources": {
              "limits": {
                "cpu": "2000m",
                "memory": "1000Mi"
              }
            }
          },
          {
            "args": [
              "-cmd=nslookup kubernetes.default.svc.cluster.local. localhost",
              "-port=8080"
            ],
            "image": "gcr.io/google_containers/exechealthz:1.0",
            "name": "healthz",
            "ports": [
              {
                "containerPort": 8080,
                "protocol": "TCP"
              }
            ],
            "resources": {
              "limits": {
                "cpu": "10m",
                "memory": "20Mi"
              }
            }
          }
        ],
        "dnsPolicy": "Default",
        "volumes": [
          {
            "emptyDir": {},
            "name": "etcd-storage"
          }
        ]
      }
    }
  }
}

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 49 (22 by maintainers)

Most upvoted comments

Stumbled across this. I think ndots really should be a configurable: https://github.com/kubernetes/kubernetes/issues/33554#issuecomment-251755257