go-dqlite: sql.DB instances hang forever when database is unreachable
Juju has occasion to change the bind address of the initial Dqlite node upon entering HA.
When this happens, existing connections keep trying to connect, with our logs repeated output like this:
2023-03-09 08:57:14 WARNING juju.worker.dbaccessor connector.go:77 attempt 2507: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:14 WARNING juju.worker.dbaccessor connector.go:77 attempt 2507: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:15 WARNING juju.worker.dbaccessor connector.go:77 attempt 2508: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:15 WARNING juju.worker.dbaccessor connector.go:77 attempt 2508: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
An existing call to Exec is locked up at this point:
# 0x46a8d4 time.Sleep+0x134 /u
sr/local/go/src/runtime/time.go:195
# 0x1c14f28 github.com/canonical/go-dqlite/internal/protocol.makeRetryStrategies.func1+0x48 /h
ome/joseph/go/pkg/mod/github.com/canonical/go-dqlite@v1.11.5/internal/protocol/connector.go:321
# 0x1c12799 github.com/Rican7/retry.shouldAttempt+0x39 /h
ome/joseph/go/pkg/mod/github.com/!rican7/retry@v0.3.1/retry.go:32
# 0x1c126f6 github.com/Rican7/retry.Retry+0x96 /h
ome/joseph/go/pkg/mod/github.com/!rican7/retry@v0.3.1/retry.go:19
# 0x1c131c4 github.com/canonical/go-dqlite/internal/protocol.(*Connector).Connect+0xa4 /h
ome/joseph/go/pkg/mod/github.com/canonical/go-dqlite@v1.11.5/internal/protocol/connector.go:74
# 0x1c42628 github.com/canonical/go-dqlite/driver.(*Connector).Connect+0x268 /h
ome/joseph/go/pkg/mod/github.com/canonical/go-dqlite@v1.11.5/driver/driver.go:261
# 0x1c05d22 database/sql.(*DB).conn+0x762 /u
sr/local/go/src/database/sql/sql.go:1393
# 0x1c0ce52 database/sql.(*Stmt).connStmt+0x252 /u
sr/local/go/src/database/sql/sql.go:2724
# 0x1c0c364 database/sql.(*Stmt).ExecContext+0x184 /u
sr/local/go/src/database/sql/sql.go:2628
# 0x46e7204 database/sql.(*Stmt).Exec+0x44 /u
sr/local/go/src/database/sql/sql.go:2651
This is problematic given Juju’s architecture, because our worker loops can be blocked from actually shutting down.
Particularly bad is that this happens with db.Ping, which we use for health checks.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 20 (19 by maintainers)
First thing to check on the go-dqlite site would that cancelling a context actually indirectly interrupts that loop. Roughly something like:
dqlite.Nodedqlite.Node.Close()against all nodes, but leave thedbobject opendb.ExecContext("...", ctx)ctxand check that it stops the connect loop that thego/sqlcode should have triggered by callingprotocol.Connector.Connect()in order to get a working connection.Sounds like a plan to me.
Maybe the non tx related functions, such as plain
ExecPingetc. could have a default timeout added, and functions that accept a context likeExecContext,BeginTxetc could add one if the context supplied doesn’t have a deadline?That way folks won’t find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.
@tomponline The problem is that dqlite hangs forever when the database is unreachable. If this happens during an
ExecorExecContextthen it will block. The problem I’m showing is that this is still a possibility because the driver does aCOMMIT(there are many more) with a context that isn’t bound to a timeout. So can hang.Except the context isn’t available via the
*Txand the conn does have a context, except it’s deprecated. So we shouldn’t attempt to use it.We do the same thing in Juju.
https://github.com/juju/juju/blob/09db847e363a757089dcb569158d7ea0dc3a3a7a/database/txn/transaction.go#L172-L173