edgemesh: ERROR: Lb err: [pod list is empty] when accessing a service hosted on a different edge node
OS: ubuntu 20.04.4 LTS KubeEdge version: v1.7.1 EdgeMesh version: v1.7.0 K8S version:1.9.0
I have an infrastructure composed of two edge nodes as shown in the picture below:
All nodes are healthy:
frapedge00001 Ready agent,edge 53d v1.19.3-kubeedge-v1.7.1
fraporion00001 Ready master 53d v1.19.12
fraprigel00001 Ready agent,edge 47d v1.19.3-kubeedge-v1.7.1
For both services, I can resolve the IP using dig
from both VMs. This means that the edgemesh gateways are doing their job properly and can resolve the DNS query.
dig pedestrian-tracking-l2-edge-svc.default
;; Warning: Message parser reports malformed message packet.
; <<>> DiG 9.16.1-Ubuntu <<>> pedestrian-tracking-l2-edge-svc.default
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58818
;; flags: qr rd ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;pedestrian-tracking-l2-edge-svc.default. IN A
;; ANSWER SECTION:
pedestrian-tracking-l2-edge-svc.default. 64 IN A 10.96.42.43
;; Query time: 0 msec
;; SERVER: 172.17.0.1#53(172.17.0.1)
;; WHEN: Tue Aug 17 22:20:46 CST 2021
;; MSG SIZE rcvd: 73
dig pedestrian-tracking-l1-edge-svc.default
;; Warning: Message parser reports malformed message packet.
; <<>> DiG 9.16.1-Ubuntu <<>> pedestrian-tracking-l1-edge-svc.default
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30930
;; flags: qr rd ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;pedestrian-tracking-l1-edge-svc.default. IN A
;; ANSWER SECTION:
pedestrian-tracking-l1-edge-svc.default. 64 IN A 10.107.107.57
;; Query time: 0 msec
;; SERVER: 172.17.0.1#53(172.17.0.1)
;; WHEN: Tue Aug 17 22:21:16 CST 2021
;; MSG SIZE rcvd: 73
The problem arises when I try to reach the applications hosted on the respective pods by using simple curl
calls. For example:
- If I use
curl
from VM2 -> Service L2 or from VM1 -> Service L1, the call is successful because I can see from the logs of the pods that they receive a request (please ignore the method not allowed error, it’s irrelevant). For example:
curl --noproxy "*" http://pedestrian-tracking-l2-edge-svc.default:6000/sedna/feature_extraction -vvv
* Trying 10.96.42.43:6000...
* TCP_NODELAY set
* Connected to pedestrian-tracking-l2-edge-svc.default (10.96.42.43) port 6000 (#0)
> GET /sedna/feature_extraction HTTP/1.1
> Host: pedestrian-tracking-l2-edge-svc.default:6000
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 405 Method Not Allowed
< Content-Length: 31
< Content-Type: application/json
< Date: Tue, 17 Aug 2021 12:47:59 GMT
< Server: uvicorn
<
* Connection #0 to host pedestrian-tracking-l2-edge-svc.default left intact
INFO: 7.182.8.34:57472 - "GET /sedna/feature_extraction HTTP/1.1" 405 Method Not Allowed
- If I use
curl
from VM2 -> Service L1 or from VM1 -> Service L2, it doesn’t work and the pod doesn’t receive any request. Also, I can see the following error message in the edgemesh gateway:
I0817 12:48:35.426515 1 dns.go:298] [EdgeMesh] dns server parse pedestrian-tracking-l2-edge-svc.default ip 10.96.42.43
I0817 12:48:35.426860 1 serviceproxy.go:53] ip: 10.96.42.43 port: 6000
W0817 12:48:35.426996 1 strategy.go:21] failed to find default.pedestrian-tracking-l2-edge-svc DestinationRule, use default strategy RoundRobin from config file
I0817 12:48:35.427049 1 log.go:181] DEBUG: add [2] handlers for chain [http]
I0817 12:48:35.427227 1 log.go:181] ERROR: Lb err: [pod list is empty]
W0817 12:48:35.427481 1 http.go:61] read http request EOF
As you can see, it seems like no pods are associated with the service but this is not true as it works when the curl
call it’s executed from the same host hosting the service (not cross-VM).
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (11 by maintainers)
After some digging with @ZBoIsHere, we found the solution to the problem (this applies only to the master branch of edgemesh, not release-1.7).
For edgemesh to work properly, the kubeconfig file needs to be loaded during the bootup phase of both the server and agent. However, adding the following:
to the
edgemesh-server
(kubectl edit cm edgemesh-server -n kubeedge
) andedgemesh-agent-cfg
(kubectl edit cm edgemesh-agent-cfg -n kubeedge
) YAML is not enough, because the pods do not have privileges to access the host filesystem (no access policies are in place). To make it work, the deployment files for the server and agent need to be changed to mount the config file into the pod. The final YAML (for the server) looks like this:What was changed is the path to mount the volume (the file in this case) with the k8s cluster config.
Finally, restart server and agent/s, and everything should work fine. In my case, the tunnel is established correctly: