cloud-sql-proxy: pgx query cancel not working with proxy

Bug Description

When a query is canceled, the cancel request from pgx is sent but not properly received/processed in the Cloud SQL server.

  • This happens when I use pgx driver with this proxy, not sure about other drivers
  • Without the proxy (e.g using IP), then the cancel request works normally

Example code (or command)

I wrote a piece of code using pgconn (the low level api in pgx), but the same issue happened with pgxpool.Pool.

Demo code
package main

import (
	"context"
	"fmt"
	"log"
	"os"
	"os/signal"
	"time"

	"github.com/jackc/pgconn"
	"github.com/pkg/errors"
)

var (
	dbuser     = "postgres"
	dbpassword = "postgres"
	dbhost     = "localhost"
	dbport     = "5432"
	dbname     = "tuananh"
)

func init() {
	if os.Getenv("PGUSER") != "" {
		dbuser = os.Getenv("PGUSER")
	}
	if os.Getenv("PGPASSWORD") != "" {
		dbpassword = os.Getenv("PGPASSWORD")
	}
	if os.Getenv("PGHOST") != "" {
		dbhost = os.Getenv("PGHOST")
	}
	if os.Getenv("PGPORT") != "" {
		dbport = os.Getenv("PGPORT")
	}
	if os.Getenv("PGDATABASE") != "" {
		dbname = os.Getenv("PGDATABASE")
	}

	log.Printf("db connection string: postgres://%s:%s@%s:%s/%s?sslmode=disable", dbuser, dbpassword, dbhost, dbport, dbname)
}

func main() {
	conn, err := connectDB()
	if err != nil {
		panic(err)
	}

	ctx, cancel := signal.NotifyContext(context.Background(), os.Interrupt)
	defer cancel()

	ctx1, cancel1 := context.WithTimeout(ctx, time.Second*10)
	defer cancel1()
	go func() {
		defer cancel1()
		query(ctx1, conn)
	}()

	time.Sleep(time.Second * 2) // sleep for 2s to prevent cancel running before query
	ctx2, cancel2 := context.WithTimeout(ctx, time.Second*5)
	defer cancel2()
	go func() {
		defer cancel2()
		cancelquery(ctx2, conn)
	}()

	<-ctx1.Done()
	<-ctx2.Done()
	log.Printf("all processes finished. Ctrl+C to exit")
	<-ctx.Done()
}

func connectDB() (*pgconn.PgConn, error) {
	connStr := fmt.Sprintf("postgres://%s:%s@%s:%s/%s?sslmode=disable", dbuser, dbpassword, dbhost, dbport, dbname)
	ctx, cancel := context.WithTimeout(context.Background(), time.Second*5)
	defer cancel()
	conn, err := pgconn.Connect(ctx, connStr)
	if err != nil {
		return nil, errors.Wrap(err, "pgxpool.ConnectConfig")
	}
	return conn, nil
}

// query runs some innocuous SELECT statement
func query(ctx context.Context, conn *pgconn.PgConn) {
	log.Println("[1] start sleeping")
	mrr := conn.Exec(ctx, "SELECT pg_sleep(60);")
	for mrr.NextResult() {
		_, _ = mrr.ResultReader().Close()
	}
	if err := mrr.Close(); err != nil { // mrr.Close() gets the first error while reading result
		log.Printf("[1] query failed: %s", err)
	}

	log.Println("[1] query finished")
}

func cancelquery(ctx context.Context, conn *pgconn.PgConn) {
	log.Println("[2] canceling query")
	err := conn.CancelRequest(ctx)
	if err != nil {
		log.Printf("[2] failed to cancel request: %s", err)
	}
	log.Println("[2] done canceling query")
}

Output

$ PGUSER=myIAMUser PGPORT=5432 PGDATABASE=mydb go run requestcancel/main.go 
2022/10/31 10:33:40 db connection string: postgres://myIAMUser:postgres@localhost:5432/mydb?sslmode=disable
2022/10/31 10:33:41 [1] start sleeping
2022/10/31 10:33:43 [2] canceling query
2022/10/31 10:33:43 [2] done canceling query
2022/10/31 10:33:51 [1] query failed: timeout: context deadline exceeded
2022/10/31 10:33:51 [1] query finished
2022/10/31 10:33:51 all processes finished. Ctrl+C to exit
^C

# but when checking inside the server, the connection is still running
(psql)
mydb=> select pid, state, query from pg_stat_activity where datname='mydb' and usename='myIAMUser';
   pid   | state  |        query         
---------+--------+----------------------
 1582790 | active | SELECT pg_sleep(60);
(1 row)

How to reproduce

  1. Run cloud_sql_proxy
  2. Run the script above to connect & use the proxy (with your appropriate PG* env vars)

Environment

  1. OS type and version: Linux anhpngt-manabie 5.15.0-52-generic #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  2. Cloud SQL Proxy version (./cloud-sql-proxy --version): Cloud SQL Auth proxy: 1.33.0+linux.amd64

I also had a quick test with cloud-sql-go-connector, but it had the same problem.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 4
  • Comments: 25 (16 by maintainers)

Most upvoted comments

Haven’t forgotten about this. The backend team will be fixing this. I’ll update here when there’s an ETA.

Sorry this slipped off my radar. Yes, the issue has been fixed. Thanks for the reminder.

A fix has been merged and will be rolling out soon. I’ll update here when I have more information.

After doing some investigation, here’s what I found:

I wrote a minimal reproduction (below) that starts a long running query, waits for a SIGTERM, and then cancels the query. This is what I did:

  1. I start the minimal reproduction
  2. Confirm the long running query is running in the database with select pid, state, query from pg_stat_activity;
  3. Send a SIGTERM to the minimal reproduction
  4. Check if the query is still running.

For both v1 and v2:

  1. Using standard DB AuthN, the call to cancel the query succeeds.
  2. Using Auto IAM AuthN, the call to cancel the query fails.

Minimal Reproduction:

package main

import (
        "context"
        "fmt"
        "log"
        "os"
        "os/signal"

        "github.com/jackc/pgconn"
)

func main() {
        var (
                dbhost = "localhost"
                dbport = "5432"

                dbuser     = "postgres"
                dbpassword = "postgres"
                dbname     = "postgres"
        )

        // Make a channel listening for SIGINT
        c := make(chan os.Signal, 1)
        signal.Notify(c, os.Interrupt)

        ctx := context.Background()

        dsn := fmt.Sprintf(
                "postgres://%s:%s@%s:%s/%s?sslmode=disable",
                dbuser, dbpassword, dbhost, dbport, dbname,
        )
        conn, err := pgconn.Connect(ctx, dsn)
        if err != nil {
                log.Fatal(err)
        }

        log.Println("starting long running query")

        // Start a long running query
        conn.Exec(ctx, "SELECT pg_sleep(60);")

        <-c
        log.Println("SIGINT received; canceling query")

        // Now cancel that query
        err = conn.CancelRequest(ctx)
        if err != nil {
                log.Fatal(err)
        }
        log.Println("query canceled")
}

Nice find @danmcgee-soda. I’m going to escalate this to see if we can get a fix shipped sooner.

Wanted to add bit more information that we discovered after investigating this issue today, and stumbling into this bug report.

  • The workaround of password-based auth also works with v1.27.0
  • This is fully reproducible with psql, eliminating pgx or other libraries completely

Using cloud_sql_proxy -dir=/tmp -instances <instance>, canceling works fine:

% psql -h /tmp/your-socket-path -U default merchant
Password for user default: 
psql (14.7 (Homebrew), server 14.4)
merchant=> select pg_sleep(15);
^CCancel request sent
ERROR:  canceling statement due to user request

But adding the -enable_iam_login flag, e.g. cloud_sql_proxy -dir=/tmp -instances <instance> -enable_iam_login breaks it:

% psql -h /tmp/our-socket-path -U default merchant
Password for user default: 
psql (14.7 (Homebrew), server 14.4)
merchant=> select pg_sleep(15);
^CCancel request sent
^CCancel request sent
^CCancel request sent

I tried cancelling three times this second case, and none were successful at getting through.

Another important bit - I actually used username/password auth in both cases- the determining factor is not actually which type of auth you USE, but only whether automatic IAM authentication is enabled.

Quick update: I’ve shared this with the backend team and they’re investigating.

This would be it. 🤓

When I connect to my Cloud SQL instance using psql and the Proxy with Auto IAM AuthN enabled, and issue a select pg_cancel_backend(<PID>); the cancellation works.

EDIT: I see that pg_cancel_backend uses a wholly separate path than CancelRequest, though. So this is likely a case where the Proxy server isn’t honoring the Postgres protocol. I’ll file a bug with the backend team and get them to look at it.

Thanks @anhpngt. I’ll do some digging on my end. I’m curious to see if the proxy client -> proxy server connection is somehow not propagating the backend PID properly. I’ll report back here.