node-postgres: "timeout exceeded when trying to connect" spike after upgrading to 8.2.1
Yesterday we upgraded from pg@8.0.3 (with pg-pool@3.1.1) to pg@8.2.1 (with pg-pool@3.2.1), specifically from our package-lock.json:
"pg": {
"version": "8.2.1",
"resolved": "https://registry.npmjs.org/pg/-/pg-8.2.1.tgz",
"integrity": "sha512-DKzffhpkWRr9jx7vKxA+ur79KG+SKw+PdjMb1IRhMiKI9zqYUGczwFprqy+5Veh/DCcFs1Y6V8lRLN5I1DlleQ==",
"requires": {
"buffer-writer": "2.0.0",
"packet-reader": "1.0.0",
"pg-connection-string": "^2.2.3",
"pg-pool": "^3.2.1",
"pg-protocol": "^1.2.4",
"pg-types": "^2.1.0",
"pgpass": "1.x",
"semver": "4.3.2"
},
"dependencies": {
"pg-connection-string": {
"version": "2.2.3",
"resolved": "https://registry.npmjs.org/pg-connection-string/-/pg-connection-string-2.2.3.tgz",
"integrity": "sha512-I/KCSQGmOrZx6sMHXkOs2MjddrYcqpza3Dtsy0AjIgBr/bZiPJRK9WhABXN1Uy1UDazRbi9gZEzO2sAhL5EqiQ=="
},
"semver": {
"version": "4.3.2",
"resolved": "https://registry.npmjs.org/semver/-/semver-4.3.2.tgz",
"integrity": "sha1-x6BxWKgL7dBSNVt3DYLWZA+AO+c="
}
}
}
We started seeing a spike in timeout exceeded when trying to connect
errors with this stacktrace:
Error: timeout exceeded when trying to connect
at Timeout._onTimeout (/usr/src/app/node_modules/pg-pool/index.js:188:27)
at listOnTimeout (internal/timers.js:549:17)
at processTimers (internal/timers.js:492:7)
This is a pretty basic express app with a postgres 12 backend running on node 12.
We report metrics on the connection pool max/total/idle/waiting count values and there is an obvious spike in the wait count from the time the 8.2.1 upgrade was deployed (around 9am CT yesterday) and then the drop when we reverted that change (about 6am CT today):
That corresponds with our API request/response/error rates (again, just a simple express app over a pg db):
We’re not sure how to debug this. These are the relevant values we’re using related to the Pool config:
- connectionTimeoutMillis = 60000
- idleTimeoutMillis = 60000
- max = 150
We have a staging environment where this showed up as well but we didn’t have an alert setup for it (we do now). So if there is something we can do to help debug this and provide information back we can probably do that in our staging environment.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 6
- Comments: 43 (23 by maintainers)
Having same issue on 8.8
I would have to dig into this. In our production cluster we have the app running in 20 replicas and each has a pool configured for 150 connections. Our readiness probe is set to hit an API which does a
select now();
query to make sure that pod’s connection pool is OK because if it’s full then we want that pod to go out of rotation for traffic until it can drain its connection requests. The pod will only crash and restart automatically if an uncaught error slips through.I think I can say when it hits we get a bunch of timeouts which would probably explain why the waiting count per pod (in the graph in the issue description) spikes, because presumably something is blocking in the pool and so other requests are waiting until a timeout occurs.
I think this issue is definitely still a thing, we’re seeing the occasional
timeout exceeded when trying to connect
error under load in production when the pool is seemingly not under much stress and still pretty far from it’s max limit. Here’s an example from our tracing system where we’re counting the number of times we’re connecting withpool.connect
, and each time we connect we log the current pool stats (idleCount
,totalCount
,waitingCount
) at the same time. Ourmax
is75
, so the pool should definitely be able to allocate a new connection. :My first suspicion for a strange timeout like this would be node event loop blocking where the process is stuck doing something for the 5s, but the event loop utilization never goes above .3:
So, I am inclined to think there’s some very subtle bug in the pool logic that causes the timeout to be hit under load despite the pool not being full. It works almost all the time for us, but it seems to be load that causes this issue rarely.
hi @mriedem, i have the same problem today, did you get any solution?