akri: [Extensibility] Agent hostNetwork setting breaks cluster DNS lookup
Describe the bug
Using hostNetwork: true for a Pod (as Akri agent does) breaks DNS name resolution for K8s services. See https://github.com/kubernetes/dns/issues/316. Adding dnsPolicy: ClusterFirstWithHostNet fixes the problem.
Details I’m developing an end:end example using HTTP (See #85).
The discovery handler consistently fails to discover the URL referenced by its handler’s discovery_endpoint value.
I don’t want to distract you with my noob issues but, if you’ve any insight into what I’m doing wrong, I’d appreciate it.
Per @bfjelds revised Nessie example, I’m also using reqwest and the agent generates the following error:
[http:discover] Entered
[http:discover] url: http://discovery:9999
[http:discover] Response: Err(reqwest::Error { kind: Request, url: "http://discovery:9999/", source: hyper::Error(Connect, ConnectError("dns error", Custom { kind: Other, error: "failed to lookup address information: Temporary failure in name resolution" })) })
[http:discover] Spoofed results
In the above, I’ve taken the get out of the control flow and am spoofing the results (see below):
async fn discover(&self) -> Result<Vec<DiscoverResult>, failure::Error> {
println!("[http:discover] Entered");
let url = self.discovery_handler_config.discovery_endpoint.clone();
println!("[http:discover] url: {}", &url);
let resp = get(&url).await;
..
}
When the discover function returned directly from matching on the get, the error was slightly more informative:
async fn discover(&self) -> Result<Vec<DiscoveryResult>, failure::Error> {
println!("[http:discover] Entered");
let url = self.discovery_handler_config.discovery_endpoint.clone();
println!("[http:discover] url: {}", &url);
match get(&url).await {
Ok(resp) => {
let device_list = &resp.text().await?;
let result: Vec<DiscoveryResult> = device_list.line().map(...).collect();
Ok(result)
}
Err(err) => {
Err(format_err!("unable to parse discovery endpoint results: {:?}", err))
}
}
Yields:
[http:discover] Entered
[http:discover] url: http://discovery:9999
[http:discover] Failed to connect to discovery endpoint: http://discovery:9999
[http:discover] Error: error sending request for url (http://discovery:9999/): error trying to connect: dns error: failed to lookup address information: Temporary failure in name resolution
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: ErrorMessage { msg: "unable to parse discovery endpoint results: reqwest::Error { kind: Request, url: \"http://discovery:9999/\", source: hyper::Error(Connect, ConnectError(\"dns error\", Custom { kind: Other, error: \"failed to lookup address information: Temporary failure in name resolution\" })) }" }', agent/src/util/config_action.rs:146:64
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
I’m relatively (!) confident that this URL is correct and the agent should be able to GET it.
If I run a curl pod in the cluster('s default namespace), the endpoint generates 200 and the correct response:
kubectl run curl --image=radial/busyboxplus:curl --stdin --tty --rm
[ root@curl:/ ]$ curl http://discovery:9999/
0.0.0.0:8000
0.0.0.0:8001
0.0.0.0:8002
0.0.0.0:8003
0.0.0.0:8004
0.0.0.0:8005
0.0.0.0:8006
0.0.0.0:8007
0.0.0.0:8008
0.0.0.0:8009
I’m at a loss to understand why this error arises but it does so consistently and reliably (I’ve tried … but will try using the Cluster IP).
I’m able to spoof the correct result by manually creating Vec<DiscoveryResult> with the values provided by the response and then the agent and broker work correctly:
kubectl logs pod/akri-http-dbb47e-pod
[http:main] Entered
[http:main] Device: http://device-8000:8000
[http:main] get_discovery_data
[http:get_discovery_data] Entered
[http:main] Environment:
PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME:akri-http-dbb47e-pod
....
[http:main] Starting gRPC server
[http:serve] Entered
[http:serve] Starting gRPC server: 0.0.0.0:8084
[http:main:loop] Sleep
[http:main:loop] read_sensor(http://device-8000:8000)
[http:read_sensor] Entered
[main:read_sensor] Response status: 200
[main:read_sensor] Response body: Ok("0.2854040649574679")
[http:main:loop] Sleep
[http:main:loop] read_sensor(http://device-8000:8000)
[http:read_sensor] Entered
[main:read_sensor] Response status: 200
[main:read_sensor] Response body: Ok("0.4158989841983801")
[http:main:loop] Sleep
[http:main:loop] read_sensor(http://device-8000:8000)
[http:read_sensor] Entered
[main:read_sensor] Response status: 200
[main:read_sensor] Response body: Ok("0.30926792372133194")
[http:main:loop] Sleep
So, beside this issue, I’m almost (still need to come up with a solution for device DNS naming…) at a solution.
Output of kubectl get pods,akrii,akric -o wide
See above
Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]
MicroK8s.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 23 (13 by maintainers)
Commits related to this issue
- Fixes: https://github.com/deislabs/akri/issues/102 (#116) — committed to project-akri/akri by DazWilkin 4 years ago
- [Extensibility] HTTP protocol (branch: http-extensibility) (#135) * Initial commit * Working * Correct errors & revise for Device|Discovery v2 * Working * Should (!) work * Typo * E... — committed to project-akri/akri by DazWilkin 4 years ago
- [Extensibility] HTTP protocol (branch: http-extensibility) (#135) * Initial commit * Working * Correct errors & revise for Device|Discovery v2 * Working * Should (!) work * Typo * E... — committed to project-akri/akri by DazWilkin 4 years ago
- [Extensibility] HTTP protocol (branch: http-extensibility) (#135) * Initial commit * Working * Correct errors & revise for Device|Discovery v2 * Working * Should (!) work * Typo * E... — committed to project-akri/akri by DazWilkin 4 years ago
- [Extensibility] HTTP protocol (branch: http-extensibility) (#135) * Initial commit * Working * Correct errors & revise for Device|Discovery v2 * Working * Should (!) work * Typo * E... — committed to kate-goldenring/akri by DazWilkin 4 years ago
- [Extensibility] HTTP protocol (branch: http-extensibility) (#135) * Initial commit * Working * Correct errors & revise for Device|Discovery v2 * Working * Should (!) work * Typo * E... — committed to kate-goldenring/akri by DazWilkin 4 years ago
- HTTP Extensibility Branch Rebase and Update (#320) * test latest k8s versions [SAME VERSION] (#188) * update changelog for 0.1.5 release (#189) * add more description and simplify commands in e... — committed to project-akri/akri by kate-goldenring 3 years ago
- [Extensibility] HTTP protocol (branch: http-extensibility) (#135) * Initial commit * Working * Correct errors & revise for Device|Discovery v2 * Working * Should (!) work * Typo * E... — committed to kate-goldenring/akri by DazWilkin 4 years ago
- HTTP Extensibility Branch Rebase and Update (#320) * test latest k8s versions [SAME VERSION] (#188) * update changelog for 0.1.5 release (#189) * add more description and simplify commands in e... — committed to kate-goldenring/akri by kate-goldenring 3 years ago
what would happen if you took
hostNetwork: trueout of the agent template (or added it to your test Pod)? i don’t remember what all we needed that for, but i’m thinking udev was the primary reason. maybe that is causing the issue?(your documentation is awesome, by the way)