go-dqlite: sql.DB instances hang forever when database is unreachable

Juju has occasion to change the bind address of the initial Dqlite node upon entering HA.

When this happens, existing connections keep trying to connect, with our logs repeated output like this:

2023-03-09 08:57:14 WARNING juju.worker.dbaccessor connector.go:77 attempt 2507: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:14 WARNING juju.worker.dbaccessor connector.go:77 attempt 2507: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:15 WARNING juju.worker.dbaccessor connector.go:77 attempt 2508: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:15 WARNING juju.worker.dbaccessor connector.go:77 attempt 2508: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused

An existing call to Exec is locked up at this point:

#       0x46a8d4        time.Sleep+0x134                                                                /u
sr/local/go/src/runtime/time.go:195
#       0x1c14f28       github.com/canonical/go-dqlite/internal/protocol.makeRetryStrategies.func1+0x48 /h
ome/joseph/go/pkg/mod/github.com/canonical/go-dqlite@v1.11.5/internal/protocol/connector.go:321
#       0x1c12799       github.com/Rican7/retry.shouldAttempt+0x39                                      /h
ome/joseph/go/pkg/mod/github.com/!rican7/retry@v0.3.1/retry.go:32
#       0x1c126f6       github.com/Rican7/retry.Retry+0x96                                              /h
ome/joseph/go/pkg/mod/github.com/!rican7/retry@v0.3.1/retry.go:19
#       0x1c131c4       github.com/canonical/go-dqlite/internal/protocol.(*Connector).Connect+0xa4      /h
ome/joseph/go/pkg/mod/github.com/canonical/go-dqlite@v1.11.5/internal/protocol/connector.go:74
#       0x1c42628       github.com/canonical/go-dqlite/driver.(*Connector).Connect+0x268                /h
ome/joseph/go/pkg/mod/github.com/canonical/go-dqlite@v1.11.5/driver/driver.go:261
#       0x1c05d22       database/sql.(*DB).conn+0x762                                                   /u
sr/local/go/src/database/sql/sql.go:1393
#       0x1c0ce52       database/sql.(*Stmt).connStmt+0x252                                             /u
sr/local/go/src/database/sql/sql.go:2724
#       0x1c0c364       database/sql.(*Stmt).ExecContext+0x184                                          /u
sr/local/go/src/database/sql/sql.go:2628
#       0x46e7204       database/sql.(*Stmt).Exec+0x44                                                  /u
sr/local/go/src/database/sql/sql.go:2651

This is problematic given Juju’s architecture, because our worker loops can be blocked from actually shutting down.

Particularly bad is that this happens with db.Ping, which we use for health checks.

About this issue

Original URL
State: open
Created a year ago
Comments: 20 (19 by maintainers)

Most upvoted comments

First thing to check on the go-dqlite site would that cancelling a context actually indirectly interrupts that loop. Roughly something like:

Run some successful queries against an in-memory cluster of dqlite.Node
Stop the dqlite cluster calling dqlite.Node.Close() against all nodes, but leave the db object open
Span a goroutine that calls db.ExecContext("...", ctx)
Cancel the ctx and check that it stops the connect loop that the go/sql code should have triggered by calling protocol.Connector.Connect() in order to get a working connection.
The goroutine should terminate immediately

freeekanayaka on Jun 7, 2023

That way folks won’t find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.

Sounds like a plan to me.

SimonRichardson on Jun 7, 2023

Maybe the non tx related functions, such as plain Exec Ping etc. could have a default timeout added, and functions that accept a context like ExecContext, BeginTx etc could add one if the context supplied doesn’t have a deadline?

That way folks won’t find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.

tomponline on Jun 7, 2023

@tomponline The problem is that dqlite hangs forever when the database is unreachable. If this happens during an Exec or ExecContext then it will block. The problem I’m showing is that this is still a possibility because the driver does a COMMIT (there are many more) with a context that isn’t bound to a timeout. So can hang.

You can do something like this to set a default timeout if not specified by the passed context:

Except the context isn’t available via the *Tx and the conn does have a context, except it’s deprecated. So we shouldn’t attempt to use it.

FWIW LXD sets a default timeout on the transaction rather than each query:

We do the same thing in Juju.

https://github.com/juju/juju/blob/09db847e363a757089dcb569158d7ea0dc3a3a7a/database/txn/transaction.go#L172-L173

SimonRichardson on Jun 7, 2023