kubernetes: Pods in StatefulSets cannot resolve each other by expected hostnames

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): Yes

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): StatefulSets, DNS (Found some issues, but all with resolutions I have tried)


Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG

Kubernetes version (use kubectl version):

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:57:25Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:22:08Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
  • Kernel (e.g. uname -a):
$ uname -a
Linux motest-7 3.10.0-514.6.2.el7.x86_64 #1 SMP Thu Feb 23 03:04:39 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux 
  • Install tools:
  • Others:

What happened: Created StatefulSet with 2 replicas, and a headless service to allow access to them. The Pods have the expected hostname, but they are unable to resolve each others hostnames.

Here is the StatefulSet and Service definitions:

apiVersion: v1
kind: Service
metadata:
  name: oklog-store
  labels:
    app: oklog-store
spec:
  ports:
    - name: store-api
      port: 12050
      targetPort: 12050
      protocol: TCP
    - name: store-cluster
      port: 12059
      targetPort: 12059
      protocol: TCP
  clusterIP: None
  selector:
    app: oklog-store
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: oklog-store
spec:
  serviceName: "oklog-store"
  replicas: 2
  template:
    metadata:
      labels:
        app: oklog-store
    spec:
      containers:
        - name: oklog-store
          image: docker.registry.lohs.geneity/dmiddlec/oklog-dmiddlec:test
          imagePullPolicy: Always
          env:
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          command:
            - "/bin/sh"
            - "-c"
            - |
              exec oklog store -debug -cluster tcp://$POD_IP:12059 \
                -api tcp://0.0.0.0:12050 \
                -peer oklog-ingest-0.oklog-ingest.$POD_NAMESPACE.svc.cluster.local:12059 \
                -peer oklog-ingest-1.oklog-ingest.$POD_NAMESPACE.svc.cluster.local:12059 \
                -peer oklog-ingest-2.oklog-ingest.$POD_NAMESPACE.svc.cluster.local:12059 \
                -peer oklog-ingest-3.oklog-ingest.$POD_NAMESPACE.svc.cluster.local:12059 \
                -peer oklog-store-0.oklog-store.$POD_NAMESPACE.svc.cluster.local:12059 \
                -peer oklog-store-1.oklog-store.$POD_NAMESPACE.svc.cluster.local:12059
          ports:
            - name: store-api
              containerPort: 12050
            - name: store-cluster
              containerPort: 12059
      terminationGracePeriodSeconds: 10
      nodeSelector:
        oklog: not-active

Note: clusterIP: None (this seems to be the resolution to this issue in all the other found cases).

Inside oklog-store-0

root@oklog-store-0:/# hostname -f
oklog-store-0.oklog-store.ci.svc.cluster.local
root@oklog-store-0:/# nslookup oklog-store.ci.svc.cluster.local
Server:         192.168.0.10
Address:        192.168.0.10#53

Name:   oklog-store.ci.svc.cluster.local
Address: 172.16.81.9
Name:   oklog-store.ci.svc.cluster.local
Address: 172.16.29.13
root@oklog-store-0:/# nslookup oklog-store-1.oklog-store.ci.svc.cluster.local
Server:         192.168.0.10
Address:        192.168.0.10#53

** server can't find oklog-store-1.oklog-store.ci.svc.cluster.local: NXDOMAIN

root@oklog-store-0:/# nslookup oklog-store-1
Server:         192.168.0.10
Address:        192.168.0.10#53

** server can't find oklog-store-1: SERVFAIL
root@oklog-store-0:/# nslookup -q=SRV oklog-store.ci.svc.cluster.local
Server:         192.168.0.10
Address:        192.168.0.10#53

oklog-store.ci.svc.cluster.local        service = 10 50 0 12c78c01.oklog-store.ci.svc.cluster.local.
oklog-store.ci.svc.cluster.local        service = 10 50 0 4749ebd6.oklog-store.ci.svc.cluster.local.
root@oklog-store-0:/# nslookup 12c78c01.oklog-store.ci.svc.cluster.local
Server:         192.168.0.10
Address:        192.168.0.10#53

Name:   12c78c01.oklog-store.ci.svc.cluster.local
Address: 172.16.29.13

root@oklog-store-0:/# nslookup 4749ebd6.oklog-store.ci.svc.cluster.local
Server:         192.168.0.10
Address:        192.168.0.10#53

Name:   4749ebd6.oklog-store.ci.svc.cluster.local
Address: 172.16.81.9

Note: nslookup 'domain' returns both records as expected. Note: The pods are available, just with unexpected hostnames. What you expected to happen:

nslookup oklog-store-1.oklog-store.ci.svc.cluster.local to resolve to the correct IP address.

How to reproduce it (as minimally and precisely as possible): Use the given versions, and use the given .yaml to load a StatefulSet and a ‘headless’ Service.

Anything else we need to know:

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 36 (2 by maintainers)

Commits related to this issue

Most upvoted comments

SOLVED: The spec.serviceName field in the statefulSet manifest was wrong. It should match the metadata.name field in the headless service definition.

For anyone interested on debugging kube-dns; https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

for me redis-cluster-1.redis-cluster works, but not redis-cluster-1

@thedodd @kuznero @alahijani

I had the exact same issue with clusterIP: None set. Turned out my issue was that I had my service selector wrong i.e

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    name: nginx

should have been

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx

Might want to check that.

Solved

Turns out I had the FQDN pattern incorrect. (Not only can you get the Service, and the Service Selector wrong, but you can also do this.)

I was using: hidden-secondary-mongo-hidden-secondary-0.hidden-secondary.svc.cluster.local

Which is the pattern: <pod>-<N>.<namespace>.svc.cluster.local

When I should have used:

hidden-secondary-mongo-hidden-secondary-0.hidden-secondary-mongo-hidden-secondary.hidden-secondary.svc.cluster.local

Which is the pattern:

<pod>-<N>.<statefulSetName>.<namespace>.svc.cluster.local

The table found here can be tremendously helpful.

Suffice to say, some names will become shorter.

I’m having similar issues with statefulset and headless service with clusterip: none, on GKE 1.10.5-gke.3. I can get the pods to resolve on pod-0.service.namespace.svc.cluster.local, but they will not listen to pod-0, only if it’s inside pod-0. So i can’t use the short hostnames execpt inside them selves.

You can resolve this issue specifying in your configuration a dnsConfig with a search entry: rabbit-service.your_namespace.svc.cluster.local

@thedodd @kuznero

The service has to be headless (clusterIP: None) for the pod DNS entries to work. Check https://github.com/kubernetes/kubernetes/issues/39197

Hi, the environment variable HOSTNAME is the pod name e.g. nifi-0

makes it impossible to resolve similarly pods of the same statefulset with just the HOSTNAME e.g. nifi-1 can’t access nifi-0 pod by it’s HOSTNAME nifi-1

It would be more helpful to make the HOSTNAME nifi-0.nifi-headless and nifi-1.nifi-headless

where nifi-headless is their common headless service name

What version of kubernetes have fixed it? I used 1.11.5 And it works correct when I set redis-cluster-1.redis-cluster.default.svc.cluster.local with redis-cluster-1 it doesn’t work

Solved

Turns out I had the FQDN pattern incorrect. (Not only can you get the Service, and the Service Selector wrong, but you can also do this.)

I was using: hidden-secondary-mongo-hidden-secondary-0.hidden-secondary.svc.cluster.local

Which is the pattern: <pod>-<N>.<namespace>.svc.cluster.local

When I should have used:

hidden-secondary-mongo-hidden-secondary-0.hidden-secondary-mongo-hidden-secondary.hidden-secondary.svc.cluster.local

Which is the pattern:

<pod>-<N>.<statefulSetName>.<namespace>.svc.cluster.local

The table found here can be tremendously helpful.

Suffice to say, some names will become shorter.

The official example in this doc just not work

@MPV The metadata.name of the headless service also needs to match the spec.serviceName of the StatefulSet.

edit: apparently it does, based on the template I just read.

Just FYI an upgrade from 1.10.1 to 1.10.6 did not help.