concourse: fly execute sometimes returns 500 error when loading an input directory containing a terraform state file

Bug Report

We were trying to run fly execute -i terraform-state-file=$TERRAFORM_PATH and we receive 500 errors around 80% of the time because it fails to upload the input directory. The other 20% of the time, the input directory uploads successfully.

Steps to Reproduce

  1. Run Concourse on Kubernetes.

  2. Deploy Ops Manager on AWS using terraform by following the instructions here: https://docs.pivotal.io/pivotalcf/2-6/om/aws/prepare-env-terraform.html

  3. After deploying Ops Manager, you should end up with a terraform.tfstate file.

  4. Put this terraform.tfstate file in an empty directory.

  5. Use this directory as an input into a Concourse task (for example, name it terraform-state-file).

  6. Run fly execute -i terraform-state-file=$PATH_TO_DIRECTORY_ABOVE.

Expected Results

Input directory should upload successfully and say “Done”.

Actual Results

Uploading fails 80% of the time with 500 error.

Additional Context

Does not happen with Bosh-deployed Concourse, error only occurs with Kubernetes-deployed Concourse.

We found a workaround for this issue. By creating a copy of the input directory inside the input directory, it uploads successfully 100% of the time. More specifically:

This doesn’t work:

terraform-state-file/
|-- terraform.tfstate

This does work:

terraform-state-file/
|-- terraform.tfstate
|-- terraform-state-file/
    |-- terraform.tfstate

Version Info

  • Concourse version: 5.6.0
  • Deployment type (BOSH/Docker/binary): Kubernetes (Helm)
  • Infrastructure/IaaS: GCP
  • Browser (if applicable): Chrome
  • Did this used to work? Not sure.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 17 (13 by maintainers)

Commits related to this issue

Most upvoted comments

Theres a race condition in the upload code for fly execute. https://github.com/concourse/concourse/blob/master/fly/commands/internal/executehelpers/uploads.go#L16-L26

This function spawns a go routine to write zstd/tar encoded payload to a pipe. However, the function may return before everything is actually written to the pipe. I added a simple sleep for 10s and fly exec started working again with the the repro given by @antonyoneill

func Upload(bar *mpb.Bar, team concourse.Team, path string, includeIgnored bool, platform string) (atc.WorkerArtifact, error) {
	files := getFiles(path, includeIgnored)

	archiveStream, archiveWriter := io.Pipe()

	go func() {
		err := archiveWriter.CloseWithError(tarfs.Compress(zstd.NewWriter(archiveWriter), path, files...))
	}()

	time.Sleep(10*time.Second)
	return team.CreateArtifact(bar.ProxyReader(archiveStream), platform)
}

cc @concourse/runtime

i really haven’t been able to repro the case where sleeping will actually make fly execute work on another machine. It only happend on that single machine…

The new library seems to work fine with this case. As there is no possible way for validate every edge case, why don’t we just leave this issue open for a while after release and see if everyones previously failing uploads work again.

@antonyoneill thanks for the reproducible steps, I can confirm its payload related. I haven’t quite figured out if its some magical byte pattern that causes this to fail or some file attributes. I notice that as soon as I add a file with actual content (non zero byte filled), the fly execute will start working again. I’ll try to dig more and follow up with this issue on the info I come across.

Hey guys, I’ve just fallen foul of this same issue.

Initially I saw it while uploading a git repo as an input to a job and have since managed to narrow it down to a reproducible example:

#!/usr/bin/env bash

# Navigate to a temporary dir
mkdir tmp-4621
(
cd tmp-4621

# Write dummy task file
cat << EOF > task.yaml
---
platform: linux

image_resource:
  type: docker-image
  source:
    repository: cypress/base
    tag: "10"

inputs:
- name: source

run:
  path: bash
  args:
    - -c
    - |-
      echo "Hello World"
EOF

# Generate large file
truncate -s 128k 128k_file

# Fly execute
fly --target ci execute --config ./task.yaml --input source=.
)

rm -rf ./tmp-4621

Output:

uploading source ⠸ (697b/s)
error: 'uploading source' failed: Unexpected Response
Status: 500 Internal Server Error
Body:

Output using a 100kb file instead:

uploading source done
executing build 1 at https://my-concourse/builds/1
initializing
running bash -c echo "Hello World"
Hello World
succeeded

The only interesting log entry from the atc during the large file upload:

{"timestamp":"2019-11-21T12:23:46.072574964Z","level":"error","source":"atc","message":"atc.create-artifact.failed-to-stream-volume-contents","data":{"error":"failed to stream in to volume","session":"48445"}}

In my case, there was a large file in the .git folder that it was uploading.

Hope this helps to reproduce and resolve


Edit: We run concourse in a k8s cluster.

$ fly --version
5.7.0