linkerd2: postgres dumps hang when routed through linkerd
Bug Report
What is the issue?
I’ve got a postgres 12.1 database, and a python (django 3.1.3) web application running in two separate pods in my GKE cluster. Normal (mostly very quick) queries between the two pods are fine when meshed through linkerd. However, when trying to dump the database (with pg_dump
when exec’ing into the web app pod), the dump hangs partway through when traffic is proxied by linkerd. I’ve observed this behavior with stable-2.9.1 and port 5432 not skipped. I also tried the edge-21.1.1 release, and made port 5432 opaque
, and saw the behavior there too. In both releases, adding port 5432 to skipped ports allows database dumps to continue smoothly.
The amount of data that gets dumped seems inconsistent – sometimes I get 100MBs from pg_dump before it hangs, other times it’s only a MB or two.
How can it be reproduced?
I can reproduce the behavior in my environment, with this sample database: https://github.com/anthonydb/practical-sql/tree/master/Chapter_11
If I load that CSV into postgres using the steps in the Chapter_11.sql file, then exec into another pod and try to dump the data, I see the hang. In my case, I created a database called linkerd_debug to put the sample table into, and dump like this: /usr/bin/pg_dump -U postgres --host postgres linkerd_debug
(postgres
is the name of the k8s service where my psql database pod lives)
Logs, error output, etc
I’ve uploaded some tcpdump output (filtered on port 5432) of a dump that’s hanging: https://gist.github.com/sjahl/52ae9e3fff78213b2697e96e382e2b6b
In the gist, you can see that packets abruptly stop at 09:29:37.235702
, at which point I waited about a minute and then ctrl-C’d the dump.
linkerd check
output
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
is running version 2.9.1 but the latest stable version is 2.9.2
see https://linkerd.io/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 2.9.1 but the latest stable version is 2.9.2
see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match
linkerd-prometheus
------------------
√ prometheus add-on service account exists
√ prometheus add-on config map exists
√ prometheus pod is running
linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running
Status check results are √
Environment
- Kubernetes Version:
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.15-gke.6000", GitCommit:"b02f5ea6726390a4b19d06fa9022981750af2bbc", GitTreeState:"clean", BuildDate:"2020-11-18T09:16:22Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}
- Cluster Environment: GKE
- Host OS: Google Container-Optimized OS with Docker (cos)
- Linkerd version: stable-2.9.1 and edge-21.1.1
Possible solution
Adding port 5432 to linkerd’s skip ports is a workaround for now.
Additional context
Slack discussion here: https://linkerd.slack.com/archives/C89RTCWJF/p1610480917116800
One more anecdote: I’m able to dump very small databases (and I suspect this is why normal queries in my app are working). Anything more than a couple of megabytes seems to trigger this hang.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (9 by maintainers)
👍 Thanks for looking into this, I really appreciate it. No worries on the 2.10 release, I think keeping port 5432 skipped is reasonable for our project in the short/medium term.
I happened to have some Azure credits, so I ran this test there too, using their default ‘Kubenet’ network plugin, and getting some notably different behavior there: The initial load.sh succeeds! But, if I run it multiple times, or try other larger queries after the data is loaded, it stalls in a similar fashion to the clusters on GKE (for example, after a successful load.sh, running a pg_dump or
select * from nyc_yellow_taxi_trips_2016_06_01 limit 5000;
both seem to get hung up for me).Logs from the successful load and a subsequent stalled
select * from nyc_yellow_taxi_trips_2016_06_01 limit 5000;
query are at: https://gist.github.com/sjahl/a26f07202efb8e026b62a15bee6e78edHappy hunting! 😃
This is fixed in edge-21.3.2. Closing for now, but please let us know if you see anything unexpected. Again, thanks for the help tracking this down.
OK, I set up a testing cluster and used the repro scripts from the gist, and it looks like I’m still seeing hangs with edge-21.2.2 😕
Behavior I observed:
config.linkerd.io/opaque-ports: "5432"
: hangsconfig.linkerd.io/skip-inbound-ports: "5432"
andconfig.linkerd.io/skip-outbound-ports: "5432"
: pg_dump completes successfullyFor these tests, I was running pg_dump to stdout, but running the
psql -U linkerd -h postgres < /tmp/data.sql
as described results in the same hangs.Linkerd debug logs for each case at: https://gist.github.com/sjahl/1f664f2c8566b9485ebe2119eecd8479
I’ll see if I can find some time today to try the repro scripts on edge-21.2.2!