concourse: fly execute sometimes returns 500 error when loading an input directory containing a terraform state file
Bug Report
We were trying to run fly execute -i terraform-state-file=$TERRAFORM_PATH and we receive 500 errors around 80% of the time because it fails to upload the input directory. The other 20% of the time, the input directory uploads successfully.
Steps to Reproduce
-
Run Concourse on Kubernetes.
-
Deploy Ops Manager on AWS using terraform by following the instructions here: https://docs.pivotal.io/pivotalcf/2-6/om/aws/prepare-env-terraform.html
-
After deploying Ops Manager, you should end up with a
terraform.tfstatefile. -
Put this
terraform.tfstatefile in an empty directory. -
Use this directory as an input into a Concourse task (for example, name it
terraform-state-file). -
Run
fly execute -i terraform-state-file=$PATH_TO_DIRECTORY_ABOVE.
Expected Results
Input directory should upload successfully and say “Done”.
Actual Results
Uploading fails 80% of the time with 500 error.
Additional Context
Does not happen with Bosh-deployed Concourse, error only occurs with Kubernetes-deployed Concourse.
We found a workaround for this issue. By creating a copy of the input directory inside the input directory, it uploads successfully 100% of the time. More specifically:
This doesn’t work:
terraform-state-file/
|-- terraform.tfstate
This does work:
terraform-state-file/
|-- terraform.tfstate
|-- terraform-state-file/
|-- terraform.tfstate
Version Info
- Concourse version: 5.6.0
- Deployment type (BOSH/Docker/binary): Kubernetes (Helm)
- Infrastructure/IaaS: GCP
- Browser (if applicable): Chrome
- Did this used to work? Not sure.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (13 by maintainers)
Theres a race condition in the upload code for fly execute. https://github.com/concourse/concourse/blob/master/fly/commands/internal/executehelpers/uploads.go#L16-L26
This function spawns a go routine to write zstd/tar encoded payload to a pipe. However, the function may return before everything is actually written to the pipe. I added a simple sleep for 10s and fly exec started working again with the the repro given by @antonyoneill
cc @concourse/runtime
i really haven’t been able to repro the case where sleeping will actually make fly execute work on another machine. It only happend on that single machine…
The new library seems to work fine with this case. As there is no possible way for validate every edge case, why don’t we just leave this issue open for a while after release and see if everyones previously failing uploads work again.
@antonyoneill thanks for the reproducible steps, I can confirm its payload related. I haven’t quite figured out if its some magical byte pattern that causes this to fail or some file attributes. I notice that as soon as I add a file with actual content (non zero byte filled), the fly execute will start working again. I’ll try to dig more and follow up with this issue on the info I come across.
Hey guys, I’ve just fallen foul of this same issue.
Initially I saw it while uploading a git repo as an input to a job and have since managed to narrow it down to a reproducible example:
Output:
Output using a 100kb file instead:
The only interesting log entry from the atc during the large file upload:
In my case, there was a large file in the
.gitfolder that it was uploading.Hope this helps to reproduce and resolve
Edit: We run concourse in a k8s cluster.