pgx: handling dead connections and broken pipes

Based on our experiences with this package, we have come up with what I consider a workaround for handling certain errors returned. In our apps, we use the ConnPool and therefore? have to handle dead connections. More recently, we found that the underlying connection can report an broken pipe and EOF. We think these errors are recoverable by the app by re-aquiring a new connection.

The function below is what I currently use in all Dao functions. Find below an example of this.

func doWithConnection(pool PgxConnPool, maxRetries int, work func(conn *pgx.Conn) error) error {
    conn, err := pool.Acquire()

    if err == pgx.ErrDeadConn {
        if maxRetries > 0 {
            glog.Debugf("%v, try acquiring a non-dead connection", err)
            return doWithConnection(pool, maxRetries-1, work)
        }
    }
    // abort if still err
    if err != nil {
        return err
    }
    defer pool.Release(conn)

    // use the connection
    err = work(conn)

    // all ok?
    if err == nil {
        return nil
    }

    // if not, can we retry?
    canRetry := (strings.HasSuffix(err.Error(), "broken pipe") ||
        strings.HasSuffix(err.Error(), "EOF"))

    if canRetry && maxRetries > 0 {
        glog.Debugf("failed to do work with connection because:[%v], retry:%d", err, maxRetries)
        return doWithConnection(pool, maxRetries-1, work)
    }

    // give up
    return err
}

Example:

func (d PgPropertyDao) DeleteKey(del PropertyAccess) error {

    return doWithConnection(d.pool, 10, func(conn *pgx.Conn) error {
        _, err := conn.Exec("select * from delete_key_api($1,$2,$3,$4,$5);",
            del.Label,
            del.Version,
            del.Key,
            del.Who,
            del.Why)
        if err != nil {
            glog.Errorf("DeleteKey -> unable to delete_key_api(%#v):%v", del, err)
        }
        return err
    })
}

The question is whether the pgx package should handle this internally and if not, it this the correct way to do so.

Thanks.

About this issue

Original URL
State: closed
Created 9 years ago
Comments: 19 (14 by maintainers)

Commits related to this issue

Fix Timestamptz.DecodeText with too short text fixes #74 — committed to jackc/pgx by jackc 4 years ago

Most upvoted comments

I just wanted to add that I ran into dead connections today within a Lambda function connecting to RDS. It’s the first time it’s happened in months of operation. This is definitely not a problem with this library, just something we need to make sure we handle on our end.

This is only a problem if you connect outside of the lambda handler, if you connect and disconnect within each run you wouldn’t run into this. I’m also just connecting to the database directly, rather than acquiring a connection from the pool. This might be a mistake but it seemed pointless to do pooling inside lambdas.

The way this surfaced itself was in the form of 2 errors:

set tcp $IP:$PORT: use of closed network connection
write tcp $IP1:$PORT1->$IP2:5432: write: connection timed out

If anyone knows the best way to handle this situation please let me know!

matthewmueller on Feb 18, 2019

No. The only retry logic in pgx is Begin() through a connection pool as that is always safe.

jackc on Oct 20, 2018

Would it be reasonable to change the behaviour of Acquire so that it will try find a live connection?

Currently if you have a connection pool which has a number of connections in it, which are then disconnected due to say Postgres server going away they will remain in the pool and will be returned to users via Acquire, who then have to check conn.IsAlive and call pool.Release (which will de-allocate the dead connection) then continue to acquire connections this way until a new connection is established with the backend.

It seems to me it would be both safe and reasonable for the connection pool to encapsulate this behaviour, attempting to exhaust the pool for a live connection before trying to establish a new one and returning an error if that fails. Thus only ever returning connections that were live at the time of acquisition.

I might put together a pull request to better illustrate the behaviour.

josephglanville on Sep 10, 2015

this occurrence is very common if you connect to the GCE Cloud SQL. the Cloud SQL will terminate idle connection after a few minutes. in other sql driver, you can set idle connection to 0 - nt a great solution. wondering if pgx have this idle connection parameter for the connection pooling as well.

badoet on Jun 13, 2015

Are you connecting over a very unstable network? Lost connections should be a very rare occurrence under normal circumstances.

Especially this code below:

    conn, err := pool.Acquire()

    if err == pgx.ErrDeadConn {
        if maxRetries > 0 {
            glog.Debugf("%v, try acquiring a non-dead connection", err)
            return doWithConnection(pool, maxRetries-1, work)
        }
    }

pool.Acquire should only return that error if it needed to establish a new connection and it failed – which assuming it previously was able to connect would likely indicate a network outage. If it’s not giving you a good connection for any other reason then that is potentially an area for improvement in pgx.

As far as the actual retry logic, it absolutely needs to wrap the call to work in a transaction. Otherwise, if work failed in the middle of executing multiple commands it would repeat already executed commands.

I don’t think pgx should handle retry logic internally. There are many potential error cases that need careful consideration. A generalized retry framework that worked as expected in the majority of situations would be difficult to get right. For most users lost connections that can immediately be recovered from by reconnecting are rare enough that it is better have the operation fail than attempt to auto-redo and potentially cause unexpected side-effects.

jackc on Apr 3, 2015