linkerd2: Unexpected behavior change on ingress injected pods in v2.13.x
What is the issue?
After upgrading from v2.12.x to v2.13.x I’ve noticed the Contour HTTPProxies configured without explicitly setting the l5d-dst-override
header started to fail with 503s generated by linkerd-proxy. These failures are reported by Contour’s envoy pods injected with the new version of linkerd-proxy in ingress mode.
How can it be reproduced?
Create a local kind cluster and apply the following configuration:
kind create cluster --name linkerd-contour
# First we install the 2.12.5 version of Linkerd
linkerd-stable-2.12.5 install --crds | kubectl apply -f -
linkerd-stable-2.12.5 install | kubectl apply -f -
# Install Contour
kubectl apply -f https://projectcontour.io/quickstart/contour.yaml
kubectl patch daemonset -n projectcontour envoy -p '{"spec":{"template":{"metadata":{"annotations":{"linkerd.io/inject": "ingress"}}}}}'
kubectl patch service -n projectcontour envoy -p '{"spec":{"type": "ClusterIP"}}'
# Install example workload
kubectl patch ns default -p '{"metadata":{"annotations":{"linkerd.io/inject": "enabled"}}}'
kubectl apply -k https://github.com/stefanprodan/podinfo//kustomize
Create HTTPProxy
apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
name: podinfo
namespace: default
spec:
routes:
- services:
- name: podinfo
port: 9898
virtualhost:
fqdn: podinfo.localtest.me
If we port-fortward to envoy:
kubectl port-forward -n projectcontour svc/envoy 8080:80
This should work just fine: http://podinfo.localtest.me:8080/
Now upgrading Linkerd to v2.13.3:
linkerd-stable-2.13.3 upgrade --crds | kubectl apply -f -
linkerd-stable-2.13.3 upgrade | kubectl apply -f -
# Restart Contour envoy to re-mesh with new version of Linkerd
kubectl rollout restart daemonset -n projectcontour envoy
kubectl port-forward -n projectcontour svc/envoy 8080:80
kubectl logs -f -n projectcontour -l app=envoy
Now, if you visit the endpoint it will fail: http://podinfo.localtest.me:8080/
upstream connect error or disconnect/reset before headers. reset reason: connection termination
Example output from envoy linkerd-proxy logs:
[ 151.573999s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=error from user's Service: buffered service failed: status: NotFound, message: "No such service", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 16 May 2023 11:31:08 GMT", "content-length": "0"} } error.sources=[buffered service failed: status: NotFound, message: "No such service", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 16 May 2023 11:31:08 GMT", "content-length": "0"} }, status: NotFound, message: "No such service", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 16 May 2023 11:31:08 GMT", "content-length": "0"} }] client.addr=10.244.0.24:60326
[ 151.688059s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=error from user's Service: buffered service failed: status: NotFound, message: "No such service", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 16 May 2023 11:31:08 GMT", "content-length": "0"} } error.sources=[buffered service failed: status: NotFound, message: "No such service", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 16 May 2023 11:31:08 GMT", "content-length": "0"} }, status: NotFound, message: "No such service", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 16 May 2023 11:31:08 GMT", "content-length": "0"} }] client.addr=10.244.0.24:33550
Example output from envoy logs:
[2023-05-16T11:31:08.177Z] "GET / HTTP/1.1" 503 UC 0 88 2 - "10.244.0.24" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.42" "9e61d3ff-9442-4eb9-a45b-389e30dd0b32" "podinfo.localtest.me" "10.244.0.15:9898"
[2023-05-16T11:31:08.290Z] "GET /favicon.ico HTTP/1.1" 503 UC 0 88 3 - "10.244.0.24" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.42" "f10ec7f6-523d-47e5-ab06-01d881ba7667" "podinfo.localtest.me" "10.244.0.16:9898"
The fix is to add the l5d-dst-override
header either in the HTTPProxy or globally in the contour configuration:
kubectl patch httpproxies.projectcontour.io -n default podinfo --type merge -p '{"spec":{"routes":[{"services": [{"name":"podinfo", "port": 9898, "requestHeadersPolicy": {"set": [{"name":"l5d-dst-override", "value": "podinfo.default.svc.cluster.local:9898"}]}}]}]}}'
Visiting again the endpoint should yield success: http://podinfo.localtest.me:8080/
Logs, error output, etc
Example messages from Contour’s envoy:
"GET /x HTTP/2" 503 UC ...
output of linkerd check -o short
Status check results are √
Environment
- Kubernetes v1.26.2-eks-a59e1f0
- EKS
- Bottlerocket OS 1.13.4
- Linkerd v2.13.3
Possible solution
I think this might just be a configuration detail that is not correctly documented or is lacking. However, the main problem seems to be with the different behavior between the two minor versions which should probably be mentioned in the upgrade guide.
Additional context
No response
Would you like to work on fixing this bug?
maybe
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 2
- Comments: 20 (11 by maintainers)
Commits related to this issue
- Adds notes about l5d-dst-override requirement 2.13 changed behavior for ingress mode proxies, failling requests without l5d-dst-override. This add notes about this behaviour change to the docs Fixe... — committed to argyle-engineering/website by chicocvenancio a year ago
- outbound: handle `NotFound` client policies in ingress mode (#2431) When the outbound proxy resolves an outbound policy from the policy controller's `OutboundPolicies` API, the policy controller may... — committed to linkerd/linkerd2-proxy by hawkw a year ago
- outbound: handle `NotFound` client policies in ingress mode (#2431) When the outbound proxy resolves an outbound policy from the policy controller's `OutboundPolicies` API, the policy controller may... — committed to linkerd/linkerd2-proxy by hawkw a year ago
- outbound: handle `NotFound` client policies in ingress mode (#2431) (#2435) When the outbound proxy resolves an outbound policy from the policy controller's `OutboundPolicies` API, the policy contro... — committed to linkerd/linkerd2-proxy by adleong a year ago
Hi folks,
Taking a look at the proxy code, this is definitely a bug that was introduced in 2.13.x. The issue is that the
NotFound
status is returned by the policy controller when the proxy looks up a traffic target that is not a ClusterIP Service (such as a pod IP): https://github.com/linkerd/linkerd2/blob/40dc561759b71ad50ee9db6a6458d16ea26b365a/policy-controller/grpc/src/outbound.rs#L69In the normal outbound proxy (not ingress mode), we handle this error by falling back to a ServiceProfile looked up from the policy controller: https://github.com/linkerd/linkerd2-proxy/blob/864a5dbc97538262f49f73e432b3a7ab071104c7/linkerd/app/outbound/src/sidecar.rs#L70 (note that we push the
self.resolver(..)
resolver function which performs ServiceProfile fallback onNotFound
gRPC responses from the policy controller: https://github.com/linkerd/linkerd2-proxy/blob/864a5dbc97538262f49f73e432b3a7ab071104c7/linkerd/app/outbound/src/discover.rs#L63).In the ingress mode proxy, on the other hand, we don’t use this logic: https://github.com/linkerd/linkerd2-proxy/blob/864a5dbc97538262f49f73e432b3a7ab071104c7/linkerd/app/outbound/src/ingress.rs#L82-L95 Instead, we fail the discovery if either the policy controller or the destination controller return a gRPC error from their discovery APIs. This is an oversight which we should address, as we should instead be doing the same synthesis behavior here.
PR linkerd/linkerd2-proxy#2431 should fix this. I’ve verified the change using the Contour and podinfo repro steps described in the first post, and the ingress mode proxy once again appears to route correctly after applying that change.
@erkerb4 I think
ingress.kubernetes.io/custom-request-headers
is not supported in traefik 2, so thel5d-dst-override
header is not being added. You could usetraefik.ingress.kubernetes.io/router.middlewares
and add middlewares for each service to keep ingress use.