concourse: Resources failing in Kubernetes with Google Container-Optimized OS after upgrade to 3.1.0
Bug Report
- Concourse version: 3.1.0
- Deployment type (Docker):
- Infrastructure/IaaS: Kubernetes
After upgrade to 3.1.0 all git and time resources are failing checks with:
runc create: exit status 1: container_linux.go:264: starting container process caused "process_linux.go:339: container init caused \"rootfs_linux.go:57: mounting \\\"/worker-state/3.1.0/assets/bin/init\\\" to rootfs \\\"/worker-state/volumes/live/26e7c69d-69fc-4f0f-507d-4b30c461a78f/volume\\\" at \\\"/worker-state/volumes/live/26e7c69d-69fc-4f0f-507d-4b30c461a78f/volume/tmp/garden-init\\\" caused \\\"open /worker-state/volumes/live/26e7c69d-69fc-4f0f-507d-4b30c461a78f/volume/tmp/garden-init: permission denied\\\"\""
Other resources seems to check fine
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 19
- Comments: 70 (27 by maintainers)
Commits related to this issue
- Make the baggageclaim FS driver configurable. Default to naive for compatibility reasons, but a warning message will be shown on install or upgrade when naive is used. Also set the default persisten... — committed to autonomic-ai/charts by deleted user 7 years ago
- Make the baggageclaim FS driver configurable. (#1528) Default to naive for compatibility reasons, but a warning message will be shown on install or upgrade when naive is used. Also set the default... — committed to helm/charts by deleted user 7 years ago
- bump etree tsa handlers mux clockwork goxmldsig genproto grpc Submodule src/github.com/beevik/etree 90dafc1e..4cd0dd97 (rewind): < add attribute sort support. < Release v1.0.1 < Update path doc... — committed to concourse/concourse by vito 6 years ago
This is still an issue for me with 3.2.1 - any plans to fix this?
I was able to avoid it if I started the worker with
--baggageclaim-driver=naiveas environment variables.Kubernetes 1.6.4(on GKE) Concourse 3.3.0(from Helm concourse-0.1.3)
I can confirm my Concourse 3.3.0 deployment to GKE k8s 1.6.4 cluster, workers running: linux kernel 4.4.35+ has the issue.
From kubectl get nodes
It seems to be related with some kernel param
I can reproduce it with kernels 4.4.35+ (Container-Optimized OS -google cloud-) 4.4.65-k8s (debian kubernetes kops)
It works fine in: 4.4.0 (ubuntu xenial) 4.9.24 coreos
Pipeline to test
Unfortunately we too experience the issue with Concourse 3.1.1 on AWS (running on Kubernetes using the helm chart). OS: Debian Jessie. Baggageclaim driver: overlay.
The problem can be reproduced with help of the following pipeline:
Concourse successfully pulls the docker image, but stumbles on running the task in non-privileged container:
If I set the task’s privileged attribute to “true” it starts working.
I tinkered with the configuration of the worker a bit and found out that the issue disappears when I switch the worker to “naive” baggageclaim driver (start the worker with
--baggageclaim-driver=naivecommand line flag). I presume the issue has something to do with running non-privileged containers using runc from a volume backed by the overlay fs driver.Thanks for the information everyone, we’ve confirmed that this is a support issue with Concourse
v3.1.0+running on Google’s Container-Optimized OS with Kernel version4.4.35+.Reproduced this using GCE cluster and the latest concourse/concourse Docker image. Digging into whether this is a kernel specific issue, or something complicated by Docker’s filesystem mounts. Running workers across all the distros! Stay tuned!
I can also confirm I’m getting similar error on 3.3.0 deployed to k8s 1.6.4 on GKE via the helm chart:
Example output from a git resource failure
I’m seeing a similar error for a docker build step after upgrading to 3.1.1:
Is this the same issue or a new one? I’m using the pinned
v1.12.6version for the ECR auth workaround:I’ve recreated workers.
@vito Recreate workers totally from scratch (volumes and all stuff) didn’t help. We finally downgrade to 3.0.1
I had a similar problem with k8s on GKE 1.8.7 deploying Concourse using Helm (also, using
cos).I tried to use the version 3.9.0 of Concourse with
btrfswithout success. The deploy worked, but when I was trying to execute a build, it was showing “No workers” in red. After deleting the Helm installation (withhelm del --purge concourse) and reinstalling withnaiveoption, it worked.@viglesiasce I’m confused by that because isn’t the file command in the container that the worker is running in not the host? I think the actual problem is that the COS image kernel doesn’t have BTRFS.
Edit: The PR I have on the helm chart works on kops 1.6.2 (k8s 1.6.7)