kubernetes: Skydns throws IO errors on large cluster, slow query performance.
A prod cluster consisting of 40 nodes and ~1100 pods seems to be bottlenecked by DNS queries. Looking at the skydns container logs I see lots of io errors talking to etcd:
2016/01/14 02:09:45 skydns: error from backend: 501: All the given peers are not reachable (failed to propose on members [http://localhost:4001] twice [last error: Get http://localhost:4001/v2/keys/skydns/local/cluster/svc/default/com/foo/bar?quorum=false&recursive=true&sorted=false: dial tcp 127.0.0.1:4001: i/o timeout]) [0]
2016/01/14 02:09:45 skydns: error from backend: 501: All the given peers are not reachable (failed to propose on members [http://localhost:4001] twice [last error: Get http://localhost:4001/v2/keys/skydns/local/cluster/svc/default/com/foo/bar?quorum=false&recursive=true&sorted=false: dial tcp 127.0.0.1:4001: i/o timeout]) [0]
2016/01/14 02:09:45 skydns: error from backend: 501: All the given peers are not reachable (failed to propose on members [http://localhost:4001] twice [last error: Get http://localhost:4001/v2/keys/skydns/local/cluster/svc/default/com/foo/bar?quorum=false&recursive=true&sorted=false: dial tcp 127.0.0.1:4001: i/o timeout]) [0]
2016/01/14 02:31:14 skydns: failure to return reply "write udp: invalid argument"
2016/01/14 02:31:31 skydns: failure to return reply "write udp: invalid argument"
2016/01/14 02:31:45 skydns: failure to return reply "write udp: invalid argument"
2016/01/14 02:33:16 skydns: failure to return reply "write udp: invalid argument"
2016/01/14 02:33:21 skydns: failure to return reply "write udp: invalid argument"
We see that pods are — in /etc/resolv.conf
— being given ndots:5
which would, if I’m not mistaken, generate five queries per external lookup (one for each entry in the search path, and a final, recursive query)? The large cluster docs indicate that scaling the size of the kube-dns pod is indicated here. I’ve confirmed that scaling out multiple kube-dns pods (with a kube2sky in each) causes high load on the docker daemon for each node in the cluster. However this issue still seems to be occurring with a scaled up pod (see below). Is there a misconfiguration here?
# sample pod /etc/resolv.conf
nameserver 10.100.100.100
nameserver 10.99.0.2
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5
{
"apiVersion": "v1",
"kind": "ReplicationController",
"metadata": {
"labels": {
"k8s-app": "kube-dns",
"kubernetes.io/cluster-service": "true",
"version": "v9"
},
"name": "kube-dns-v9",
"namespace": "kube-system"
},
"spec": {
"replicas": 1,
"selector": {
"k8s-app": "kube-dns",
"version": "v9"
},
"template": {
"metadata": {
"labels": {
"k8s-app": "kube-dns",
"kubernetes.io/cluster-service": "true",
"version": "v9"
}
},
"spec": {
"containers": [
{
"command": [
"/usr/local/bin/etcd",
"-data-dir",
"/var/etcd/data",
"-listen-client-urls",
"http://127.0.0.1:2379,http://127.0.0.1:4001",
"-advertise-client-urls",
"http://127.0.0.1:2379,http://127.0.0.1:4001",
"-initial-cluster-token",
"skydns-etcd"
],
"image": "gcr.io/google_containers/etcd:2.0.9",
"name": "etcd",
"resources": {
"limits": {
"cpu": "1000m",
"memory": "500Mi"
}
},
"volumeMounts": [
{
"mountPath": "/var/etcd/data",
"name": "etcd-storage"
}
]
},
{
"args": [
"-domain=cluster.local"
],
"image": "gcr.io/google_containers/kube2sky:1.11",
"name": "kube2sky",
"resources": {
"limits": {
"cpu": "200m",
"memory": "100Mi"
}
}
},
{
"args": [
"-machines=http://localhost:4001",
"-addr=0.0.0.0:53",
"-ns-rotate=false",
"-domain=cluster.local."
],
"image": "gcr.io/google_containers/skydns:2015-10-13-8c72f8c",
"livenessProbe": {
"httpGet": {
"path": "/healthz",
"port": 8080,
"scheme": "HTTP"
},
"initialDelaySeconds": 30,
"timeoutSeconds": 5
},
"name": "skydns",
"ports": [
{
"containerPort": 53,
"name": "dns",
"protocol": "UDP"
},
{
"containerPort": 53,
"name": "dns-tcp",
"protocol": "TCP"
}
],
"readinessProbe": {
"httpGet": {
"path": "/healthz",
"port": 8080,
"scheme": "HTTP"
},
"initialDelaySeconds": 1,
"timeoutSeconds": 5
},
"resources": {
"limits": {
"cpu": "2000m",
"memory": "1000Mi"
}
}
},
{
"args": [
"-cmd=nslookup kubernetes.default.svc.cluster.local. localhost",
"-port=8080"
],
"image": "gcr.io/google_containers/exechealthz:1.0",
"name": "healthz",
"ports": [
{
"containerPort": 8080,
"protocol": "TCP"
}
],
"resources": {
"limits": {
"cpu": "10m",
"memory": "20Mi"
}
}
}
],
"dnsPolicy": "Default",
"volumes": [
{
"emptyDir": {},
"name": "etcd-storage"
}
]
}
}
}
}
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 49 (22 by maintainers)
Stumbled across this. I think
ndots
really should be a configurable: https://github.com/kubernetes/kubernetes/issues/33554#issuecomment-251755257