sqlx: 0.6.0: Pool::close does not completely close when awaited

I have a somewhat-hackish setup to create a test database for every integration tests, on struct creation it creates the database with a connection pool and on drop it closes the pool before deleting the database. (Side Note: if someone has a better way of doing that I am definitely interested)

Since upgrading to v0.6.0 the database drop fails because there is still connections open:

Could not drop test db because of error returned from database: database "slicendice-integration-test-b61901b5-b86a-4e9b-820b-a8c81bf3cc8" is being accessed by other users

It only seems to happen on MacOS (I have not tested locally on Linux but here is a CI run demonstrating the problem).

I have not been able to create a small example that demonstrate the problem, following is my attempt that does not exhibit it.

Simple example

use std::env;
use std::time::Duration;

use anyhow::{Context, Result};
use async_std::task;
use sqlx::Connection;
use sqlx::{PgConnection, PgPool};
use tide::http::Url;
use uuid::Uuid;

async fn use_db(pool: PgPool) -> Result<()> {
    sqlx::query(
        "CREATE TABLE things (
          id SERIAL PRIMARY KEY,
          name TEXT NOT NULL
        )",
    )
    .execute(&pool)
    .await?;

    sqlx::query(
        "INSERT INTO things ( name )
         VALUES ( 'A name' )",
    )
    .execute(&pool)
    .await?;

    Ok(())
}

#[async_std::main]
async fn main() -> Result<()> {
    let db_url_string =
        env::var("DATABASE_URL").context("DATABASE_URL env variable needs to be set")?;

    // Create DB
    let mut db_url = Url::parse(&db_url_string)?;
    let id: Uuid = Uuid::new_v4();
    let mut db_name = String::from("slicendice-integration-test-");
    db_name.push_str(&id.to_string());

    db_url.set_path("/");
    let mut conn = PgConnection::connect(&db_url.to_string()).await?;

    sqlx::query(&format!(r#"CREATE DATABASE "{}""#, &db_name))
        .execute(&mut conn)
        .await?;

    // Use DB
    let mut db_path = String::from("/");
    db_path.push_str(&db_name);
    db_url.set_path(&db_path);
    let pool = PgPool::connect(&db_url.to_string()).await?;
    use_db(pool.clone()).await?;

    // Drop DB
    pool.close().await;

    dbg!(pool.size());
    dbg!(pool.num_idle());

    let mut db_url = db_url.clone();
    db_url.set_path("/");

    let mut conn = PgConnection::connect(&db_url.to_string()).await?;
    // Using format here because bind doesn't work for DROP DATABASE
    sqlx::query(&format!(r#"DROP DATABASE "{}""#, &db_name))
        .execute(&mut conn)
        .await?;

    Ok(())
}

My guess would be that because I am passing the pool to tide’s context it’s being held longer that in my simple example as when I create a test that only creates the scaffold and makes a call to the database without creating a tide application the problem does not happen.

I have added a call to Pool::size after an awaited Pool::close, in v0.5 it is consistently equal to 0 while in v0.6 it is always equal to 1.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 16

Commits related to this issue

Most upvoted comments

But this got me curious and the internal impl of PoolConnection makes use of the sqlx-rt crate to fetch a statically stored common runtime for AsyncDrop purposes.

No, it does not. It references the Tokio runtime set for the current thread the code is executing on (sqlx_rt::Handle is a reexport of tokimo::runtime::Handle), only if the runtime-tokio-* feature is enabled. That static RUNTIME is for use in sqlx-macros only, which is implied by the comment immediately above it. It doesn’t really need to exist in sqlx-rt for that reason which is why I’ve moved it in my current refactoring effort (#2039).

I guess this means that pool connections are designed to be transferable across runtimes from the same library and this is indeed a bug.

They are not, and Tokio actually does not like it when you move TcpStreams from one runtime to another. It does continue to work but only until the TcpStream’s original runtime exits, at which point it will panic if you try to use it.

async-std only ever has one runtime per process so that’s different.

The builtin test runner also has this as a “feature” for debugging sakes.

That is a deliberate feature, yes. If you have a test failure, it’s useful to go back and look at the contents of the database to help figure out what went wrong. They won’t pile up forever, though. Old test databases are cleaned up every time on startup of any binary using #[sqlx::test] (as stated in the next paragraph after the one that describes this feature), so at most you’ll have as many as the number of failed tests in the latest run.

I had intended to add commands to sqlx-cli to manage test databases but I wanted to focus on getting the core #[sqlx::test] functionality working and merged.


All that aside, I’ve realized that there’s likely a TOCTOU bug in Pool::close() where it thinks all connections have been released and then returns, but there may still be some in-flight.

Specifically, here: https://github.com/launchbadge/sqlx/blob/main/sqlx-core/src/pool/inner.rs#L257

If .close() is called on another thread between .acquire_permit() and .try_increment_size() when the size is zero, .close() will exit because it sees the size is zero and (incorrectly) assumes that there will never be any more connections on the pool.

It used to acquire self.options.max_connections permits right away, which would preclude any concurrent connection attempts, but that had to be changed with the #[sqlx::test] functionality as that allows pools to have a parent/child relationship where the child steals semaphore permits from the parent as needed, in order to enforce a global connection limit for all running tests. Simply put, there may not be max_connections permits on the semaphore anymore, and stealing them from the parent pool just so the child pool can close would be wasteful.

I think the fix here would be to change .try_increment_size() to immediately return Err(permit) if the pool is closed, and then the logic in acquire() will loop back to the top and then return Err(Error::PoolClosed) on the next call to .acquire_permit().

That should eliminate the possibility of size being zero and thus .close() returning while connection attempts can still be pending.