gitness: Build status not updated, but build is complete

Hello,

I’ve got a fun issue with drone 0.7. All jobs of a build were successfully completed, but the build was shown as still being running. I clicked on “Cancel” button on the build page and the build info page shown that the build was killed (the build info page had information about the build in red). However in the list of builds on the left and in output from “drone build info” I see that the build is still running.

$ drone build info repo/project 20
Number: 20
Status: running
Event: push
...

And no new builds can start (they are waiting in the pending state).

I see the following logs in the agent:

pipeline: finish uploading logs: 540: step smoke-test
pipeline: ping queue: 540
pipeline: execution complete: 540
pipeline: ping queue: 540
rpc: error making call: jsonrpc2: code 0 message: queue: task cancelled
pipeline: cancel signal received: 540: jsonrpc2: code 0 message: queue: task cancelled
pipeline: cancel ping loop: 540

I’ve tried to restart drone server and agent and the pending build started running, but the build that was stuck is still shown as “running”.

Thank you.

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 20

Most upvoted comments

I wanted to provide a quick update, since it looks like there were multiple root causes to this issue and we have at least two solutions now. The below comment was copied from discourse, you can visit the origin thread here.

I just merged a pull request that fixes an issue where large log output causes the upload to return an error due to exceeding the maximum grpc payload size. The agent will continue to retry the upload indefinitely because the error will always be the same, thus causing the build to get stuck.

Thanks to @tboerger for pinpointing the exact error:

err: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger
than max (7399047 vs. 4194304)

This fix will limit the size of the logs (per step) to ensure it does not execeed the grpc limits. A more permanent solution will be to implement grpc streaming, which is in the long term, is definitely how this should be implemented anyway.

So in conclusion I believe there were at least two different root causes for builds getting stuck that we have discovered:

docker bug where logs stuck after killing build moby/moby#30135
drone infinitely retrying when payload size error received drone/drone#2208

I therefore believe that both upgrading docker and getting the drone/agent:latest image with the patch liming log size will resolve this issue for most, if not all, people.

bradrydzewski on Sep 12, 2017

I think one option would be to run docker:dind in the same pod as your drone agent instead of using the host machine docker daemon. This would allow you to use newer versions of docker with drone, with an added benefit of a bit more host machine isolation.

Unfortunately most of the streaming code sits inside the docker library, as opposed to drone code, which limits our options for addressing the issue. (assuming we can verify this is a docker issue)

bradrydzewski on Jul 17, 2017