cloud-sql-proxy: GCR image missing shell in v1.16

Seems like ash has gone from the latest container version

We used it at cloudbuild as like

  - id: 'db-proxy'
    name: eu.gcr.io/cloudsql-docker/gce-proxy
    waitFor: ['extract-api-build']
    entrypoint: ash
    args:
      - '-c'
      - |
        /cloud_sql_proxy -dir=/cloudsql -instances=$(cat ./_CLOUD_SQL_INSTANCE_) & \
        while [ ! -f /cloudsql/stop ]; do
          sleep 2;
        done
        kill $!
    volumes:
      - name: db
        path: /cloudsql

For the trick which allowed us to not close connection until touch /cloudsql/stop command from other step.

Now no ash or bash in container and it become impossible to use it in cloudbuild in a simple manner

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 31
  • Comments: 30 (10 by maintainers)

Commits related to this issue

Most upvoted comments

I have the same issue. I don’t really mind that there is no shell per se, but using 1.16 currently breaks the deployment of this image in our GKE Kubernetes clusters, as we used nc in our liveness probes, which now also can’t be found.

Switching to distroless is quite a radical change and should be tagged as a new major release version 2.0.0 IMHO. This would make it clear to everyone that fundamental changes have been might that might very well break builds.

Hey folks, apologies if this broke any builds.

This was done intentionally, we switched from using alpine to distroless as a base container to improve security. (You can see the new build in our Dockerfile.) I think we’ll need to evaluate if we want to offer a second image for folks that need a shell.

I’m not sure the best way to introduce the shell into a distroless container, but it looks like you can use the above dockerfile and update the distoless base image with the debug tag to add the busybox shell in the meantime.

Any chance that the release notes mention the fact that you changed to distroless? (https://github.com/GoogleCloudPlatform/cloudsql-proxy/releases/tag/1.16)

That seems to be a big change worth of being mentioned there.

Is version 1.15 build somewhere else/differently? The commit referenced seems to also be in 1.15? I don’t know if it is “officially” documented anywhere but I’m betting that most people who use the container in cloud build are using it like @istarkov mentioned, probably taken/inspired by https://stackoverflow.com/a/52366671/2339622

I was using a similar solution (running a shell script in the gce-proxy container), and it broke for me as well. Here is the workaround I have come up with: use the docker builder to send a TERM signal to the gce-proxy container.

steps:
  # DB Proxy
  # Launch Cloud SQL proxy and keep it open
  # until the db migration step is finished
  - name: "gcr.io/cloudsql-docker/gce-proxy:1.16"
    id: proxy
    waitFor: ["-"]
    volumes:
      - name: db
        path: "/cloudsql"
    args:
      - "/cloud_sql_proxy"
      - "-dir=/cloudsql"
      - "-instances=$_DB_CONNECTION_NAME"

  # Run Migrations.
  - name: "gcr.io/$PROJECT_ID/my-api"
    id: dbmigrate
    waitFor: ["-"]
    volumes:
      - name: db
        path: "/cloudsql"
    args:
      - "sh"
      - "-c"
      - "sleep 5; echo hello world; ls -lh /cloudsql"

  - name: "gcr.io/cloud-builders/docker"
    id: killproxy
    waitFor: ["dbmigrate"]
    entrypoint: "sh"
    args:
      - "-c"
      - 'docker kill -s TERM $(docker ps -q --filter "volume=db")'

Apologies for any inconveniences this caused. I would strong recommend avoiding the use of latest in the future, and instead follow container best practices by pining to a specific version (but updating it often).

We’re not using latest, but we do have automated container updates using Flux CD, based on semantic versioning and had 1.* on the whitelist. As said before, I see such a fundamental change to the underlying container as a major update (1.15 > 2.0), rather than a minor one (1.15 > 1.16). So I would not have expected such a backwards incompatible change. For now we’ve pinned it shut to 1.16 after we fixed the issues it gave.

Yeah this broke my CD pipeline, absolutely fundamentally my fault for copy/pasting the above stack overflow article and for using :latest as an image version in cloud build, but still a bit too much of a major change to not warrant a major bump. Current fix for me is to specify a version for the step (Excuse the JSON, this will not work in a cloudbuild.yaml):

 {
      id: "cloud-sql-proxy",
      name: "gcr.io/cloudsql-docker/gce-proxy:1.15",
      waitFor: ["other_step"],
      entrypoint: "sh",
      args: [
        "-c",
        `/cloud_sql_proxy -dir=/cloudsql -instances=${cloudsqlconnection} -credential_file=<vcredentialfilepath> & while [ ! -f /cloudsql/stop ]; do sleep 2; done`
      ],

      volumes: [
        {
          name: "db",
          path: "/cloudsql"
        }
      ]
    },

changing to gcr.io/cloudsql-docker/gce-proxy:1.15 in yaml should work the same.

  1. The default base image will remain based on distroless, which doesn’t provide a shell. This is to remain more secure by default.
  2. We will be adding alternative images (likely vX.X-buster and vX.X-alpine) that will use different base images. AFAIK, both buster (debian) and alpine images include shells by default.

Hey folks,

Apologies for any inconveniences this caused. I would strong recommend avoiding the use of latest in the future, and instead follow container best practices by pining to a specific version (but updating it often).

We’re not planning on guaranteeing the existence of any shells or other tools in the binary other than the proxy. If we were to define an API for this image, it would only contain the path that the proxy is located at (/cloud_sql_proxy). I apologize that this wasn’t more clear previously, and will make sure going forward this is clearly stated in our docs.

For using the proxy with other tools, I would recommend downloading a release binary into an environment with the correct versions of other tools. We have versioned release links on the release page that can be used.

For Cloud Build, I believe the correct way to connect to Cloud SQL would be to download the proxy to /workspace and execute it as part of the same step. I’ll work on verifying this and getting a more concrete example for folks to take a look at in the next few days.

Oh my goodness, what a brain fart! Sorry about that!

I would suggest this issue could be closed, since the -alpine and -buster images exist now.

@kurtisvg can you give some context on the outcome of this issue? There you link an alpine-based issue and an ubuntu-based image. Which one is it based on now?

I came here looking for a way to implement this healthcheck like:

 database:
    image: gcr.io/cloudsql-docker/gce-proxy:1.17
    # ...
    healthcheck:
      test: pg_isready --dbname=${DB_NAME} --host=${DB_HOST} --port=${DB_PORT} --username=${DB_USER}

But I’m not sure of the outcome of the discussion. Can you please give me some context on how to access this shell? I suppose that Ubuntu does have a shell while alpine does not. So Which one is it built against now?

Thanks!

Closing this in favor of #330 and #371

For the cloudsql proxy authors, what do you think about adding a command to the proxy itself so it can do the healthcheck? i.e. cloud_sql_proxy healthz which can be invoked as a healthcheck

I would consider this a bug - IIRC the intended behavior was to continue accepting new connections until the term_timeout duration expired.

Reproduction isn’t necessary to disprove this, the code in Shutdown() is clear enough:

// Shutdown waits up to a given amount of time for all active connections to
// close. Returns an error if there are still active connections after waiting
// for the whole length of the timeout.
func (c *Client) Shutdown(termTimeout time.Duration) error {
	termTime := time.Now().Add(termTimeout)
	for termTime.After(time.Now()) && atomic.LoadUint64(&c.ConnectionsCounter) > 0 {
		time.Sleep(1)
	}

	active := atomic.LoadUint64(&c.ConnectionsCounter)
	if active == 0 {
		return nil
	}
	return fmt.Errorf("%d active connections still exist after waiting for %v", active, termTimeout)
}

The sleeping for-loop will terminate if either the specific term_timeout value is reached, or if the number of connections drops to zero. As I described before, this can result in exiting the proxy container while the pod is still receiving incoming requests via the ingress.

Should I make a new issue to request that this be changed to the following?

func (c *Client) Shutdown(termTimeout time.Duration) error {
        time.Sleep(termTimeout)
	active := atomic.LoadUint64(&c.ConnectionsCounter)
	if active == 0 {
		return nil
	}
	return fmt.Errorf("%d active connections still exist after waiting for %v", active, termTimeout)
}

?