cloud-sql-proxy: failed to connect to instance: Dial error: failed to dial

Bug Description

Since we’ve upgraded from 1.*.* to 2.*.* we have been noticing ‘a lot’ of “failed to connect to instance: Dial error: failed to dial (connection name = “******”): dial tcp **.***.***.**:3307: i/o timeout” errors in the SQL Proxy container. We have been trying to debug this issue, but there doesn’t really seem to be a clear explanation for it. It is happening across multiple (PHP) applications using different frameworks. Downgrading back to 1.*.* seems to resolve the issue, which (I think) rules out that it is network related. The database is not under (very) heavy load and we’re only using ~5% of our connections limit.

Example code (or command)

...
      - command:
        - /cloud-sql-proxy
        - --health-check
        - --http-address=0.0.0.0
        - --credentials-file=/secrets/cloudsql/cloudsqlproxy-credentials.json
        - --max-sigterm-delay=60s
        - --structured-logs
        - --quiet
        - toppy-***:europe-west4:toppy-***-database
        image: eu.gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.2.0
        imagePullPolicy: IfNotPresent
        name: cloudsqlproxy-container
        ports:
        - containerPort: 3306
          protocol: TCP
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          allowPrivilegeEscalation: true
          privileged: false
          readOnlyRootFilesystem: false
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65532
        startupProbe:
          failureThreshold: 20
          httpGet:
            path: /startup
            port: 9090
            scheme: HTTP
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 5
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /secrets/cloudsql
          mountPropagation: None
          name: cloudsqlproxy-service-account
          readOnly: true
...

Stacktrace

n/a

Steps to reproduce?

Not sure, included the sidecar yaml

Environment

OS type and version: GKE 1.25.8-gke.500, Container-Optimized OS with containerd (cos_containerd)
Cloud SQL Proxy version (./cloud-sql-proxy --version): 2.2.0
Proxy invocation command (for example, ./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME): /cloud-sql-prox --health-check --http-address=0.0.0.0 --credentials-file=/secrets/cloudsql/cloudsqlproxy-credentials.json --max-sigterm-delay=60s --structured-logs --quiet toppy-***:europe-west4:toppy-***-database

Additional Details

Connection is made over a public IP

Edit: updated yaml

About this issue

Original URL
State: closed
Created a year ago
Comments: 26 (11 by maintainers)

Most upvoted comments

Got it. It’s unclear why public IP would cause this problem, but in any case private IP is a better path for both performance and security.

enocom on Oct 12, 2023

Yep, our instance had a public IP only when we saw the issues (private was disabled). When we switched, we enabled the private IP on the instance, then added the --private-ip flag to our cloud-sql-proxy command.

jault3 on Oct 12, 2023

@enocom since the problems seemed somehow ‘connection’ related I was looking into the GKE DNS settings, which now allows you to use ‘Cloud DNS’ instead of ‘kube-dns’. I’m not sure if it was the private ip option (not going through egress/ingress NAT) or DNS which in the end resolved the issue since both were applied at the same time.

Swahjak on Jul 31, 2023

@Swahjak Thanks for raising an issue on the Cloud SQL Proxy 😄

handing this over to @hessjcg who is our GKE specialist, he should be able to shed some light on this and hopefully help you get to the bottom of additional errors being seen.

jackwotherspoon on Jun 14, 2023