linkerd2: Unexpected behavior change on ingress injected pods in v2.13.x

What is the issue?

After upgrading from v2.12.x to v2.13.x I’ve noticed the Contour HTTPProxies configured without explicitly setting the l5d-dst-override header started to fail with 503s generated by linkerd-proxy. These failures are reported by Contour’s envoy pods injected with the new version of linkerd-proxy in ingress mode.

How can it be reproduced?

Create a local kind cluster and apply the following configuration:

kind create cluster --name linkerd-contour
# First we install the 2.12.5 version of Linkerd
linkerd-stable-2.12.5 install --crds | kubectl apply -f -
linkerd-stable-2.12.5 install | kubectl apply -f -

# Install Contour
kubectl apply -f https://projectcontour.io/quickstart/contour.yaml
kubectl patch daemonset -n projectcontour envoy -p '{"spec":{"template":{"metadata":{"annotations":{"linkerd.io/inject": "ingress"}}}}}'
kubectl patch service -n projectcontour envoy -p '{"spec":{"type": "ClusterIP"}}'

# Install example workload
kubectl patch ns default -p '{"metadata":{"annotations":{"linkerd.io/inject": "enabled"}}}'
kubectl apply -k https://github.com/stefanprodan/podinfo//kustomize

Create HTTPProxy

apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: podinfo
  namespace: default
spec:
  routes:
  - services:
    - name: podinfo
      port: 9898
  virtualhost:
    fqdn: podinfo.localtest.me

If we port-fortward to envoy:

kubectl port-forward -n projectcontour svc/envoy 8080:80

This should work just fine: http://podinfo.localtest.me:8080/

Now upgrading Linkerd to v2.13.3:

linkerd-stable-2.13.3 upgrade --crds | kubectl apply -f -
linkerd-stable-2.13.3 upgrade | kubectl apply -f -

# Restart Contour envoy to re-mesh with new version of Linkerd
kubectl rollout restart daemonset -n projectcontour envoy
kubectl port-forward -n projectcontour svc/envoy 8080:80
kubectl logs -f -n projectcontour -l app=envoy

Now, if you visit the endpoint it will fail: http://podinfo.localtest.me:8080/

upstream connect error or disconnect/reset before headers. reset reason: connection termination

Example output from envoy linkerd-proxy logs:

[   151.573999s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=error from user's Service: buffered service failed: status: NotFound, message: "No such service", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 16 May 2023 11:31:08 GMT", "content-length": "0"} } error.sources=[buffered service failed: status: NotFound, message: "No such service", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 16 May 2023 11:31:08 GMT", "content-length": "0"} }, status: NotFound, message: "No such service", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 16 May 2023 11:31:08 GMT", "content-length": "0"} }] client.addr=10.244.0.24:60326
[   151.688059s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=error from user's Service: buffered service failed: status: NotFound, message: "No such service", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 16 May 2023 11:31:08 GMT", "content-length": "0"} } error.sources=[buffered service failed: status: NotFound, message: "No such service", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 16 May 2023 11:31:08 GMT", "content-length": "0"} }, status: NotFound, message: "No such service", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 16 May 2023 11:31:08 GMT", "content-length": "0"} }] client.addr=10.244.0.24:33550

Example output from envoy logs:

[2023-05-16T11:31:08.177Z] "GET / HTTP/1.1" 503 UC 0 88 2 - "10.244.0.24" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.42" "9e61d3ff-9442-4eb9-a45b-389e30dd0b32" "podinfo.localtest.me" "10.244.0.15:9898"
[2023-05-16T11:31:08.290Z] "GET /favicon.ico HTTP/1.1" 503 UC 0 88 3 - "10.244.0.24" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.42" "f10ec7f6-523d-47e5-ab06-01d881ba7667" "podinfo.localtest.me" "10.244.0.16:9898"

The fix is to add the l5d-dst-override header either in the HTTPProxy or globally in the contour configuration:

kubectl patch httpproxies.projectcontour.io -n default podinfo --type merge -p '{"spec":{"routes":[{"services": [{"name":"podinfo", "port": 9898, "requestHeadersPolicy": {"set": [{"name":"l5d-dst-override", "value": "podinfo.default.svc.cluster.local:9898"}]}}]}]}}'

Visiting again the endpoint should yield success: http://podinfo.localtest.me:8080/

Logs, error output, etc

Example messages from Contour’s envoy:

"GET /x HTTP/2" 503 UC ...

output of linkerd check -o short

Status check results are √

Environment

  • Kubernetes v1.26.2-eks-a59e1f0
  • EKS
  • Bottlerocket OS 1.13.4
  • Linkerd v2.13.3

Possible solution

I think this might just be a configuration detail that is not correctly documented or is lacking. However, the main problem seems to be with the different behavior between the two minor versions which should probably be mentioned in the upgrade guide.

Additional context

No response

Would you like to work on fixing this bug?

maybe

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 20 (11 by maintainers)

Commits related to this issue

Most upvoted comments

Hi folks,

Taking a look at the proxy code, this is definitely a bug that was introduced in 2.13.x. The issue is that the NotFound status is returned by the policy controller when the proxy looks up a traffic target that is not a ClusterIP Service (such as a pod IP): https://github.com/linkerd/linkerd2/blob/40dc561759b71ad50ee9db6a6458d16ea26b365a/policy-controller/grpc/src/outbound.rs#L69

In the normal outbound proxy (not ingress mode), we handle this error by falling back to a ServiceProfile looked up from the policy controller: https://github.com/linkerd/linkerd2-proxy/blob/864a5dbc97538262f49f73e432b3a7ab071104c7/linkerd/app/outbound/src/sidecar.rs#L70 (note that we push the self.resolver(..) resolver function which performs ServiceProfile fallback on NotFound gRPC responses from the policy controller: https://github.com/linkerd/linkerd2-proxy/blob/864a5dbc97538262f49f73e432b3a7ab071104c7/linkerd/app/outbound/src/discover.rs#L63).

In the ingress mode proxy, on the other hand, we don’t use this logic: https://github.com/linkerd/linkerd2-proxy/blob/864a5dbc97538262f49f73e432b3a7ab071104c7/linkerd/app/outbound/src/ingress.rs#L82-L95 Instead, we fail the discovery if either the policy controller or the destination controller return a gRPC error from their discovery APIs. This is an oversight which we should address, as we should instead be doing the same synthesis behavior here.

PR linkerd/linkerd2-proxy#2431 should fix this. I’ve verified the change using the Contour and podinfo repro steps described in the first post, and the ingress mode proxy once again appears to route correctly after applying that change.

@erkerb4 I think ingress.kubernetes.io/custom-request-headers is not supported in traefik 2, so the l5d-dst-override header is not being added. You could use traefik.ingress.kubernetes.io/router.middlewares and add middlewares for each service to keep ingress use.