concourse: Jobs getting stuck at preparing build
Bug Report
For the last couple of days we’ve been having a strange issue. Sporadically Concourse seems to be getting stuck at the “preparing build” step, all the pre-requisitives show a ✔️ next to them (so nothing is actually pending as far as we can see) however the job stays stuck forever in that state.
When this starts happening, it affects all jobs in all teams. Fly watch of the pending jobs also hangs and shows no output. I haven’t been able to narrow down exactly what fixes it, one time it was retiring+reseting one of the workers. Another time that wasn’t enough and I had to delete+recreate the pipeline with the stuck job.
The only error messages I could see in the logs were these (logged when one of the workers was reset):
{"timestamp":"1517591651.263898134","source":"atc","message":"atc.create-job-build.trigger-immediately.try-start-next-pending-build.scan.failed-to-create-container-in-db","log_level":2,"data":{"build-id":8663,"build-name":"70","error":"pq: insert or update on table \"worker_resource_config_check_sessions\" violates foreign key constraint \"worker_resource_config_check__resource_config_check_sessio_fkey\"","input":"npm-debian-image","job_name":"test-mailer-master","resource":"npm-debian-image","session":"48972.1.1.2"}}
{"timestamp":"1517591651.263942957","source":"atc","message":"atc.create-job-build.trigger-immediately.try-start-next-pending-build.scan.failed-to-initialize-new-container","log_level":2,"data":{"build-id":8663,"build-name":"70","error":"pq: insert or update on table \"worker_resource_config_check_sessions\" violates foreign key constraint \"worker_resource_config_check__resource_config_check_sessio_fkey\"","input":"npm-debian-image","job_name":"test-mailer-master","resource":"npm-debian-image","session":"48972.1.1.2"}}
{"timestamp":"1517591651.265183210","source":"atc","message":"atc.create-job-build.trigger-immediately.failed-to-start-next-pending-build-for-job","log_level":2,"data":{"error":"pq: insert or update on table \"worker_resource_config_check_sessions\" violates foreign key constraint \"worker_resource_config_check__resource_config_check_sessio_fkey\"","job-name":"test-mailer-master","job_name":"test-mailer-master","session":"48972.1"}}
(the above was printed for multiple jobs, above are just the logs for one of the jobs)
We also see the messages below quite frequently on a regular basis (but since they also show up when everything is running normally I’m not sure they are relevant):
{"timestamp":"1518004152.986703157","source":"atc","message":"atc.volume-collector.run.orphaned-volumes.mark-created-as-destroying.failed-to-transition","log_level":2,"data":{"error":"volume cannot be destroyed as children are present","session":"62.13706.1.43","volume":"69b59425-83cd-4511-4a7f-f0a011c0b697","worker":"ccn4"}}
The following can also be handy:
- Concourse version: 3.8.0
- Deployment type: binary
- Infrastructure/IaaS: Running on VMs (Debian 8.10 OS with 4.9.0-0.bpo.3-amd64 kernel)
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 8
- Comments: 26 (10 by maintainers)
This issue was fixed along with #1598, with the fix being to remove the worker resource config check sessions table and merge it with the resource config check sessions table. If you want to read more on the fix you can do so here commit. Another issue was found which was related is #2911 where the resource configs were being gced before they were used. This issue was fixed with a commit that creates and uses the resource configs within one transaction.
@vito @clarafu I am still seeing this in 5.0.0, 5.0.1 and 5.1.0.
Here is my stackdump: https://gist.github.com/tanner-bruce/e410118ea9710bd9670f9e920221b27d
We’ve run into the same issue, I believe.
Hung at
checking job is not paused/preparing buildstage. This is the first time it’s happened to us, and it was right after we had a major git outage, which caused all of our git resources to be unusable for hours. 3-4 git resources per pipeline, over 400 pipelines… after git came back online, now we’re starting to see these issues.Worth noting that we also get the
pq: insert or update on table [...]error when our git resources get backed up due to downed external services, but I believe this was resolved in another issue/commit. We also seevolume cannot be destroyed as children are present [...]frequently much like the original issue.Restarting atc/tsa doesn’t seem to make any difference, still hung for us. recreating the pipeline doesn’t make a difference. Aborting and starting the run over again doesn’t make a difference either.
Concourse v3.13.0, bosh deployed
Though, I believe there may be some relation in my case to https://github.com/concourse/concourse/issues/520 or https://github.com/concourse/concourse/issues/536 – even though in our example, we only have one job in the given pipeline.
I was able to resolve the issue by unpausing… an unpaused pipeline, hah. As you can see below, it’s not paused, but I run the
unpause-jobsubcommand anyway, and it no longer had any issue.