linkerd2: proxy error:client requires absolute-form URIs

Bug Report

What is the issue?

~15 hours after upgrading our staging cluster from edge-20.7.5 to stable-2.9.0, we observed connectivity errors between multiple pods in the same namespace that normally communicate over k8s services. From the perspective of the client, all requests to http://server/foo were returned instantly with a 502 error, while the linkerd-proxy log on both the client and server reported e.g:

[  8109.623316s]  WARN ThreadId(01) inbound:accept{peer.addr=10.4.2.144:50078 target.addr=10.4.2.205:3000}: linkerd2_app_core::errors: Failed to proxy request: client requires absolute-form URIs

Note that the pods were not restarted after the linkerd update, so while the control plane was running 2.9.0, the pods were still running the older version of the proxy.

Interestingly, if from the client pod I manually curl http://server.namespace/foo rather than http://server/foo the request succeeds; it’s only the un-qualified version of the URL that fails.

How can it be reproduced?

At least for us, the repro path is to update from the old edge release to the new stable release and then wait 12-24 hours; it’s happened twice so far. That said I do not know what level of traffic (if any) might be necessary to trigger it and there is no obvious reason why the issue starts when it does.

linkerd check output

Pre-update:

$ linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust roots are using supported crypto algorithm
√ trust roots are within their validity period
√ trust roots are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust root

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 2.7.1 but the latest stable version is 2.9.0
    see https://linkerd.io/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 20.7.5 but the latest edge version is 20.11.5
    see https://linkerd.io/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-20.7.5 but cli running stable-2.7.1
    see https://linkerd.io/checks/#l5d-version-control for hints

Status check results are √```

### Environment

- Kubernetes Version: 1.16.15-gke.4300
- Cluster Environment: GKE
- Host OS: Google Container-Optimized OS (cos)
- Linkerd version: edge-20.7.5; stable-2.9.0


About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 24 (11 by maintainers)

Commits related to this issue

Most upvoted comments

So, it turns out that, while we did fix this in the proxy and backport the fix onto the 2.9.x version of the proxy, this fixed version isn’t what was actually tagged for release:

commit ee3fa1483a51a3e3414a23e51d6ef0b02c30098c (tag: release/v2.124.1, tag: release/v2.124.0)

The proxy tag release/v2.124.0 is what was used for 2.9.0; and the release/v2.124.1 tag points to the same commit 😞

We’ll create a v2.124.2 tag on the proper commit (d1766c00) and confirm the fix for a stable-2.9.4.

Again, my apologies for the confusion.

stable-2.9.4 has been released and is available via

:; curl -sL https://run.linkerd.io/install |sh -

I’ve confirmed that the fix is included as follows:

# Create a k3d cluster with a version that supports Linkerd 2.8.1
:; k3d cluster create --image rancher/k3s:v1.18.9-k3s1

# Install 2.8.1
:; ~/.linkerd2/bin/linkerd-stable-2.8.1 install | k apply -f -

# Run emojivoto
:; curl -sL https://run.linkerd.io/emojivoto.yml | ~/.linkerd2/bin/linkerd-stable-2.8.1 inject - | k apply -f -

# Confirm the installed version is 2.9.4 before upgrading
:; linkerd version
Client version: stable-2.9.4
Server version: stable-2.8.1

# Upgrade the control plane
:; linkerd upgrade | k apply -f -

# ... wait for the control plane upgrade to complete ...

# Restart only emojivoto's `web` deployment
:; k rollout restart -n emojivoto deploy/web

# Confirm traffic is flowing through the restarted pod
:; linkerd stat -n emojivoto deploy
NAME       MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   TCP_CONN
emoji         1/1   100.00%   2.0rps           1ms           1ms           1ms          2
vote-bot      1/1         -        -             -             -             -          -
voting        1/1    86.67%   1.0rps           1ms           1ms           1ms          2
web           1/1    94.38%   1.5rps           2ms           3ms           3ms          2

# Ensure no errors in the proxy logs
:; k logs -n emojivoto deploy/web -c linkerd-proxy
time="2021-02-24T00:54:50Z" level=info msg="running version stable-2.9.4"
[     0.000578s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.000949s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.000957s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.000959s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.000960s]  INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[     0.000962s]  INFO ThreadId(01) linkerd2_proxy: Local identity is web.emojivoto.serviceaccount.identity.linkerd.cluster.local
[     0.000966s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.000968s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.001111s]  INFO ThreadId(01) outbound: linkerd2_app: listen.addr=127.0.0.1:4140 ingress_mode=false
[     0.001151s]  INFO ThreadId(01) inbound: linkerd2_app: listen.addr=0.0.0.0:4143
[     0.011597s]  INFO ThreadId(02) daemon:identity: linkerd2_app: Certified identity: web.emojivoto.serviceaccount.identity.linkerd.cluster.local

Please let us know if you see anything unexpected with stable-2.9.4!

Hello, once again thank you for the release of linkerd 2.9.3. Just recently we tried to update our cluster again from version 2.8.1 to 2.9.3. However same as before we still noticed the same problem. Where in, services that are still running proxy 2.8.1 failed to establish http connection to services that are running with 2.9.3 proxy. This happened in our UAT EKS cluster with version k8s 1.18.

To further test the issue and ensuring that it was not due to underlaying cluster setup. I tried to replicate it in my local machine using k8s(1.16.6) on docker desktop, and sadly same behaviour.

Below are the steps to replicate.

  • Install linkerd 2.8.1 from scratch (linkerd install | kubectl apply -f - )
  • Deploy test services (nginx or httpd). (app1, app2, app3 in my case)
  • Expose each with their corresponding service (app1.default.svc.cluster.local, app2.default.svc.cluster.local, app3.default.svc.cluster.local)
  • At this stage when tested, as expected no issue as all are running the same proxy (2.8.1)
  • Upgrade linkerd to 2.9.3 (linkerd upgrade | kubectl apply -f - )
  • Rolling restart app2 and app3 (kubectl rollout restart deployment/app{2,3})
  • app1 yields 502 when invoking query via curl against both app2 and app3

I understand the flow is similar to what @olix0r described. But I am not sure what i am missing here, is there other step that needs to be considered when doing the upgrade from 2.8.1 -> 2.9.3?

Linkerd proxy check

linkerd-data-plane
------------------
√ data plane namespace exists
√ data plane proxies are ready
√ data plane proxy metrics are present in Prometheus
‼ data plane is up-to-date
    Some data plane pods are not running the current version:
	* default/app1-6fcb445b46-pgzsc (stable-2.8.1)
    see https://linkerd.io/checks/#l5d-data-plane-version for hints
‼ data plane and cli versions match
    default/app1-6fcb445b46-pgzsc running stable-2.8.1 but cli running stable-2.9.3
    see https://linkerd.io/checks/#l5d-data-plane-cli-version for hints

I have attached the logs of the linkerd-proxy of the 3 apps for reference.

app1-trace.log app2-trace.log app3-trace.log

This is fixed in the latest edge release and we’ll be releasing a backported fix as stable-2.9.3 next week.

What’s the approach to avoid downtime/5xx error while upgrading from 2.8.1 to 2.9.1/2? Just ran into the same issue in our staging environment. Should data planes not be backward compatible by at least 1 version? Did anyone find a workaround?

By the way I can confirm that this is now working. Tried it locally and it worked (using linkerd 2.9.4). Thanks again

@adinhodovic Appreciate the offer. Given @olix0r’s repro (woot!) we’re probably in good shape but I’ll let him chime in if he wants more logs.