cloud-sql-proxy: failed to connect to instance: Dial error: failed to dial
Bug Description
Since we’ve upgraded from 1.*.* to 2.*.* we have been noticing ‘a lot’ of “failed to connect to instance: Dial error: failed to dial (connection name = “******”): dial tcp **.***.***.**:3307: i/o timeout” errors in the SQL Proxy container. We have been trying to debug this issue, but there doesn’t really seem to be a clear explanation for it. It is happening across multiple (PHP) applications using different frameworks. Downgrading back to 1.*.* seems to resolve the issue, which (I think) rules out that it is network related. The database is not under (very) heavy load and we’re only using ~5% of our connections limit.
Example code (or command)
...
- command:
- /cloud-sql-proxy
- --health-check
- --http-address=0.0.0.0
- --credentials-file=/secrets/cloudsql/cloudsqlproxy-credentials.json
- --max-sigterm-delay=60s
- --structured-logs
- --quiet
- toppy-***:europe-west4:toppy-***-database
image: eu.gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.2.0
imagePullPolicy: IfNotPresent
name: cloudsqlproxy-container
ports:
- containerPort: 3306
protocol: TCP
resources:
requests:
cpu: 100m
memory: 100Mi
securityContext:
allowPrivilegeEscalation: true
privileged: false
readOnlyRootFilesystem: false
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65532
startupProbe:
failureThreshold: 20
httpGet:
path: /startup
port: 9090
scheme: HTTP
periodSeconds: 1
successThreshold: 1
timeoutSeconds: 5
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /secrets/cloudsql
mountPropagation: None
name: cloudsqlproxy-service-account
readOnly: true
...
Stacktrace
n/a
Steps to reproduce?
- Not sure, included the sidecar yaml
Environment
- OS type and version: GKE 1.25.8-gke.500, Container-Optimized OS with containerd (cos_containerd)
- Cloud SQL Proxy version (
./cloud-sql-proxy --version
): 2.2.0 - Proxy invocation command (for example,
./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME
):/cloud-sql-prox --health-check --http-address=0.0.0.0 --credentials-file=/secrets/cloudsql/cloudsqlproxy-credentials.json --max-sigterm-delay=60s --structured-logs --quiet toppy-***:europe-west4:toppy-***-database
Additional Details
- Connection is made over a public IP
Edit: updated yaml
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 26 (11 by maintainers)
Got it. It’s unclear why public IP would cause this problem, but in any case private IP is a better path for both performance and security.
Yep, our instance had a public IP only when we saw the issues (private was disabled). When we switched, we enabled the private IP on the instance, then added the
--private-ip
flag to our cloud-sql-proxy command.@enocom since the problems seemed somehow ‘connection’ related I was looking into the GKE DNS settings, which now allows you to use ‘Cloud DNS’ instead of ‘kube-dns’. I’m not sure if it was the private ip option (not going through egress/ingress NAT) or DNS which in the end resolved the issue since both were applied at the same time.
@Swahjak Thanks for raising an issue on the Cloud SQL Proxy 😄
handing this over to @hessjcg who is our GKE specialist, he should be able to shed some light on this and hopefully help you get to the bottom of additional errors being seen.