linkerd2: postgres dumps hang when routed through linkerd

Bug Report

What is the issue?

I’ve got a postgres 12.1 database, and a python (django 3.1.3) web application running in two separate pods in my GKE cluster. Normal (mostly very quick) queries between the two pods are fine when meshed through linkerd. However, when trying to dump the database (with pg_dump when exec’ing into the web app pod), the dump hangs partway through when traffic is proxied by linkerd. I’ve observed this behavior with stable-2.9.1 and port 5432 not skipped. I also tried the edge-21.1.1 release, and made port 5432 opaque, and saw the behavior there too. In both releases, adding port 5432 to skipped ports allows database dumps to continue smoothly.

The amount of data that gets dumped seems inconsistent – sometimes I get 100MBs from pg_dump before it hangs, other times it’s only a MB or two.

How can it be reproduced?

I can reproduce the behavior in my environment, with this sample database: https://github.com/anthonydb/practical-sql/tree/master/Chapter_11

If I load that CSV into postgres using the steps in the Chapter_11.sql file, then exec into another pod and try to dump the data, I see the hang. In my case, I created a database called linkerd_debug to put the sample table into, and dump like this: /usr/bin/pg_dump -U postgres --host postgres linkerd_debug (postgres is the name of the k8s service where my psql database pod lives)

Logs, error output, etc

I’ve uploaded some tcpdump output (filtered on port 5432) of a dump that’s hanging: https://gist.github.com/sjahl/52ae9e3fff78213b2697e96e382e2b6b

In the gist, you can see that packets abruptly stop at 09:29:37.235702, at which point I waited about a minute and then ctrl-C’d the dump.

`linkerd check` output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 2.9.1 but the latest stable version is 2.9.2
    see https://linkerd.io/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.9.1 but the latest stable version is 2.9.2
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match

linkerd-prometheus
------------------
√ prometheus add-on service account exists
√ prometheus add-on config map exists
√ prometheus pod is running

linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running

Status check results are √

Environment

Kubernetes Version: Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.15-gke.6000", GitCommit:"b02f5ea6726390a4b19d06fa9022981750af2bbc", GitTreeState:"clean", BuildDate:"2020-11-18T09:16:22Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}
Cluster Environment: GKE
Host OS: Google Container-Optimized OS with Docker (cos)
Linkerd version: stable-2.9.1 and edge-21.1.1

Possible solution

Adding port 5432 to linkerd’s skip ports is a workaround for now.

Additional context

Slack discussion here: https://linkerd.slack.com/archives/C89RTCWJF/p1610480917116800

One more anecdote: I’m able to dump very small databases (and I suspect this is why normal queries in my app are working). Anything more than a couple of megabytes seems to trigger this hang.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 17 (9 by maintainers)

Most upvoted comments

👍 Thanks for looking into this, I really appreciate it. No worries on the 2.10 release, I think keeping port 5432 skipped is reasonable for our project in the short/medium term.

I happened to have some Azure credits, so I ran this test there too, using their default ‘Kubenet’ network plugin, and getting some notably different behavior there: The initial load.sh succeeds! But, if I run it multiple times, or try other larger queries after the data is loaded, it stalls in a similar fashion to the clusters on GKE (for example, after a successful load.sh, running a pg_dump or select * from nyc_yellow_taxi_trips_2016_06_01 limit 5000; both seem to get hung up for me).

Logs from the successful load and a subsequent stalled select * from nyc_yellow_taxi_trips_2016_06_01 limit 5000; query are at: https://gist.github.com/sjahl/a26f07202efb8e026b62a15bee6e78ed

Happy hunting! 😃

sjahl on Mar 2, 2021

This is fixed in edge-21.3.2. Closing for now, but please let us know if you see anything unexpected. Again, thanks for the help tracking this down.

olix0r on Mar 9, 2021

OK, I set up a testing cluster and used the repro scripts from the gist, and it looks like I’m still seeing hangs with edge-21.2.2 😕

Behavior I observed:

with: config.linkerd.io/opaque-ports: "5432": hangs
with no special port skips/opaque ports: hangs
with config.linkerd.io/skip-inbound-ports: "5432" and config.linkerd.io/skip-outbound-ports: "5432": pg_dump completes successfully

For these tests, I was running pg_dump to stdout, but running the psql -U linkerd -h postgres < /tmp/data.sql as described results in the same hangs.

Linkerd debug logs for each case at: https://gist.github.com/sjahl/1f664f2c8566b9485ebe2119eecd8479

% linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-api
-----------
√ control plane pods are ready

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

 % linkerd version
Client version: edge-21.2.2
Server version: edge-21.2.2

sjahl on Feb 12, 2021

I’ll see if I can find some time today to try the repro scripts on edge-21.2.2!