edgemesh: ERROR: Lb err: [pod list is empty] when accessing a service hosted on a different edge node

OS: ubuntu 20.04.4 LTS KubeEdge version: v1.7.1 EdgeMesh version: v1.7.0 K8S version:1.9.0

I have an infrastructure composed of two edge nodes as shown in the picture below:

image

All nodes are healthy:

frapedge00001    Ready    agent,edge   53d   v1.19.3-kubeedge-v1.7.1
fraporion00001   Ready    master       53d   v1.19.12
fraprigel00001   Ready    agent,edge   47d   v1.19.3-kubeedge-v1.7.1

For both services, I can resolve the IP using dig from both VMs. This means that the edgemesh gateways are doing their job properly and can resolve the DNS query.

dig pedestrian-tracking-l2-edge-svc.default
;; Warning: Message parser reports malformed message packet.

; <<>> DiG 9.16.1-Ubuntu <<>> pedestrian-tracking-l2-edge-svc.default
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58818
;; flags: qr rd ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;pedestrian-tracking-l2-edge-svc.default. IN A

;; ANSWER SECTION:
pedestrian-tracking-l2-edge-svc.default. 64 IN A 10.96.42.43

;; Query time: 0 msec
;; SERVER: 172.17.0.1#53(172.17.0.1)
;; WHEN: Tue Aug 17 22:20:46 CST 2021
;; MSG SIZE  rcvd: 73
dig pedestrian-tracking-l1-edge-svc.default
;; Warning: Message parser reports malformed message packet.

; <<>> DiG 9.16.1-Ubuntu <<>> pedestrian-tracking-l1-edge-svc.default
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30930
;; flags: qr rd ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;pedestrian-tracking-l1-edge-svc.default. IN A

;; ANSWER SECTION:
pedestrian-tracking-l1-edge-svc.default. 64 IN A 10.107.107.57

;; Query time: 0 msec
;; SERVER: 172.17.0.1#53(172.17.0.1)
;; WHEN: Tue Aug 17 22:21:16 CST 2021
;; MSG SIZE  rcvd: 73

The problem arises when I try to reach the applications hosted on the respective pods by using simple curl calls. For example:

  • If I use curl from VM2 -> Service L2 or from VM1 -> Service L1, the call is successful because I can see from the logs of the pods that they receive a request (please ignore the method not allowed error, it’s irrelevant). For example:
 curl --noproxy "*" http://pedestrian-tracking-l2-edge-svc.default:6000/sedna/feature_extraction -vvv
*   Trying 10.96.42.43:6000...
* TCP_NODELAY set
* Connected to pedestrian-tracking-l2-edge-svc.default (10.96.42.43) port 6000 (#0)
> GET /sedna/feature_extraction HTTP/1.1
> Host: pedestrian-tracking-l2-edge-svc.default:6000
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 405 Method Not Allowed
< Content-Length: 31
< Content-Type: application/json
< Date: Tue, 17 Aug 2021 12:47:59 GMT
< Server: uvicorn
<
* Connection #0 to host pedestrian-tracking-l2-edge-svc.default left intact
INFO:     7.182.8.34:57472 - "GET /sedna/feature_extraction HTTP/1.1" 405 Method Not Allowed
  • If I use curl from VM2 -> Service L1 or from VM1 -> Service L2, it doesn’t work and the pod doesn’t receive any request. Also, I can see the following error message in the edgemesh gateway:
I0817 12:48:35.426515       1 dns.go:298] [EdgeMesh] dns server parse pedestrian-tracking-l2-edge-svc.default ip 10.96.42.43
I0817 12:48:35.426860       1 serviceproxy.go:53] ip: 10.96.42.43 port: 6000
W0817 12:48:35.426996       1 strategy.go:21] failed to find default.pedestrian-tracking-l2-edge-svc DestinationRule, use default strategy RoundRobin from config file
I0817 12:48:35.427049       1 log.go:181] DEBUG: add [2] handlers for chain [http]
I0817 12:48:35.427227       1 log.go:181] ERROR: Lb err: [pod list is empty]
W0817 12:48:35.427481       1 http.go:61] read http request EOF

As you can see, it seems like no pods are associated with the service but this is not true as it works when the curl call it’s executed from the same host hosting the service (not cross-VM).

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (11 by maintainers)

Most upvoted comments

After some digging with @ZBoIsHere, we found the solution to the problem (this applies only to the master branch of edgemesh, not release-1.7).

For edgemesh to work properly, the kubeconfig file needs to be loaded during the bootup phase of both the server and agent. However, adding the following:

    kubeAPIConfig:
      burst: 200
      contentType: application/vnd.kubernetes.protobuf
      kubeConfig: "/path/to/.kube/config"
      #master: "http://127.0.0.1:10550"
      qps: 100

to the edgemesh-server (kubectl edit cm edgemesh-server -n kubeedge) and edgemesh-agent-cfg (kubectl edit cm edgemesh-agent-cfg -n kubeedge) YAML is not enough, because the pods do not have privileges to access the host filesystem (no access policies are in place). To make it work, the deployment files for the server and agent need to be changed to mount the config file into the pod. The final YAML (for the server) looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: kubeedge
    kubeedge: edgemesh-server
  name: edgemesh-server
  namespace: kubeedge
spec:
  selector:
    matchLabels:
      k8s-app: kubeedge
      kubeedge: edgemesh-server
  template:
    metadata:
      labels:
        k8s-app: kubeedge
        kubeedge: edgemesh-server
    spec:
      hostNetwork: true
#     use label to selector node
      nodeName: fraporion00001
      containers:
      - name: edgemesh-server
        image: kubeedge/edgemesh-server:v1.7.0-21-ge18ebf1
        env:
          - name: MY_NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
        ports:
        - containerPort: 10005
          name: relay
          protocol: TCP
        resources:
          limits:
            cpu: 200m
            memory: 1Gi
          requests:
            cpu: 100m
            memory: 512Mi
        securityContext:
          privileged: true
          procMount: Default
        volumeMounts:
          - name: config
            mountPath: /path/to/.kube/config
            readOnly: true
          - name: conf
            mountPath: /etc/kubeedge/config
          - name: edgemesh
            mountPath: /etc/kubeedge/edgemesh
      restartPolicy: Always
      serviceAccountName: edgemesh-server
      volumes:
        - name: config
          hostPath:
            path: "/path/to/.kube/config"
            type: File
        - name: conf
          configMap:
            name: edgemesh-server
        - name: edgemesh
          hostPath:
            path: /etc/kubeedge/edgemesh
            type: DirectoryOrCreate

What was changed is the path to mount the volume (the file in this case) with the k8s cluster config.

Finally, restart server and agent/s, and everything should work fine. In my case, the tunnel is established correctly:

I0819 09:55:47.794682       1 dns.go:73] dns server parse pedestrian-tracking-l2-edge-svc.default. ip 10.97.159.121
I0819 09:55:47.795044       1 proxy.go:37] clusterIP: 10.97.159.121, servicePort: 6000
W0819 09:55:47.795129       1 util.go:21] DestinationRule "default.pedestrian-tracking-l2-edge-svc" not found, use default strategy [RoundRobin] from config file
I0819 09:55:47.795155       1 log.go:181] DEBUG: add [2] handlers for chain [tcp]
I0819 09:55:47.795368       1 handler.go:59] l4 proxy get tcpserver address: frapedge00001:7.182.8.34:6000
I0819 09:55:47.800347       1 tcpproxy.go:166] frapedge00001 dial 7.182.8.34:6000 success
I0819 09:55:47.800362       1 handler.go:106] l4 proxy start proxy data between tcpserver frapedge00001:7.182.8.34:6000
I0819 09:55:47.803776       1 handler.go:112] Success proxy to frapedge00001:7.182.8.34:6000