cloud-sql-proxy: failed to connect to instance: Dial error: failed to dial

Bug Description

Since we’ve upgraded from 1.*.* to 2.*.* we have been noticing ‘a lot’ of “failed to connect to instance: Dial error: failed to dial (connection name = “******”): dial tcp **.***.***.**:3307: i/o timeout” errors in the SQL Proxy container. We have been trying to debug this issue, but there doesn’t really seem to be a clear explanation for it. It is happening across multiple (PHP) applications using different frameworks. Downgrading back to 1.*.* seems to resolve the issue, which (I think) rules out that it is network related. The database is not under (very) heavy load and we’re only using ~5% of our connections limit.

Example code (or command)

...
      - command:
        - /cloud-sql-proxy
        - --health-check
        - --http-address=0.0.0.0
        - --credentials-file=/secrets/cloudsql/cloudsqlproxy-credentials.json
        - --max-sigterm-delay=60s
        - --structured-logs
        - --quiet
        - toppy-***:europe-west4:toppy-***-database
        image: eu.gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.2.0
        imagePullPolicy: IfNotPresent
        name: cloudsqlproxy-container
        ports:
        - containerPort: 3306
          protocol: TCP
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          allowPrivilegeEscalation: true
          privileged: false
          readOnlyRootFilesystem: false
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65532
        startupProbe:
          failureThreshold: 20
          httpGet:
            path: /startup
            port: 9090
            scheme: HTTP
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 5
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /secrets/cloudsql
          mountPropagation: None
          name: cloudsqlproxy-service-account
          readOnly: true
...

Stacktrace

n/a

Steps to reproduce?

  1. Not sure, included the sidecar yaml

Environment

  1. OS type and version: GKE 1.25.8-gke.500, Container-Optimized OS with containerd (cos_containerd)
  2. Cloud SQL Proxy version (./cloud-sql-proxy --version): 2.2.0
  3. Proxy invocation command (for example, ./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME): /cloud-sql-prox --health-check --http-address=0.0.0.0 --credentials-file=/secrets/cloudsql/cloudsqlproxy-credentials.json --max-sigterm-delay=60s --structured-logs --quiet toppy-***:europe-west4:toppy-***-database

Additional Details

  • Connection is made over a public IP

Edit: updated yaml

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 26 (11 by maintainers)

Most upvoted comments

Got it. It’s unclear why public IP would cause this problem, but in any case private IP is a better path for both performance and security.

Yep, our instance had a public IP only when we saw the issues (private was disabled). When we switched, we enabled the private IP on the instance, then added the --private-ip flag to our cloud-sql-proxy command.

@enocom since the problems seemed somehow ‘connection’ related I was looking into the GKE DNS settings, which now allows you to use ‘Cloud DNS’ instead of ‘kube-dns’. I’m not sure if it was the private ip option (not going through egress/ingress NAT) or DNS which in the end resolved the issue since both were applied at the same time.

@Swahjak Thanks for raising an issue on the Cloud SQL Proxy 😄

handing this over to @hessjcg who is our GKE specialist, he should be able to shed some light on this and hopefully help you get to the bottom of additional errors being seen.