azure-pipelines-agent: Self-hosted agents intermittently do not pick up new jobs

Having issue with YAML?

No

Having issue with Tasks?

No

Having issue with software on Hosted Agent?

No

Having generic issue with Azure-Pipelines/VSTS/TFS?

No

Have you tried troubleshooting?

Yes

Agent Version and Platform

Version of your agent? 2.175.2

OS of the machine running the agent? CentOS 7

Azure DevOps Type and Version

dev.azure.com

If dev.azure.com, what is your organization name? https://dev.azure.com/ (will provide this privately if necessary)

What’s not working?

We have a series of pipelines that all behave the same way:

First Stage

Second Stage

  • Wait for the self-hosted agent to pick up the work, based on a custom “demand” that looks for the unique agent name
  • Run some custom code on the self-hosted agent
  • Finish running custom code
  • Delete GCP VM

Agent and Worker’s Diagnostic Logs

See the following log files for an example of a successful run, and an unsuccessful run.

self-hosted-agent-log-failure.log self-hosted-agent-log-successful.log

Key differences that I’ve noticed:

The Linux version printed at the top is different, though I’m not exactly sure how/why, and I’m not sure why that would matter for this particular issue:

successful 
[2020-10-28 05:30:47Z INFO AgentProcess] RuntimeInformation: Linux 4.19.150+ #1 SMP Sat Oct 24 07:57:26 PDT 2020.

failure
[2020-10-21 23:01:22Z INFO AgentProcess] RuntimeInformation: Linux 5.4.49+ #1 SMP Sun Oct 18 19:43:35 PDT 2020.

Note that the failure log shows that the agent is listening for jobs but then times out after 30 minutes, but the success log receives the job within 30 seconds

successful
[2020-10-28 05:30:49Z INFO MessageListener] Session created.
[2020-10-28 05:30:49Z INFO Terminal] WRITE LINE: 2020-10-28 05:30:49Z: Listening for Jobs
[2020-10-28 05:30:49Z INFO JobDispatcher] Set agent/worker IPC timeout to 30 seconds.
[2020-10-28 05:31:28Z INFO RSAFileKeyManager] Loading RSA key parameters from file /azp/agent/.credentials_rsaparams
[2020-10-28 05:31:28Z INFO MessageListener] Message '1' received from session 'b2dbac0f-1ab5-45ec-ae40-c811b5d35d0d'.
[2020-10-28 05:31:28Z INFO JobDispatcher] Job request 2037 for plan e2905f74-12be-4282-8fb2-215cd5c5d3f3 job fc308004-fcdd-5de5-2151-99c66bc3b9d8 received.
[2020-10-28 05:31:28Z INFO Terminal] WRITE LINE: 2020-10-28 05:31:28Z: Running job: Build container


failure
[2020-10-21 23:01:23Z INFO RSAFileKeyManager] Loading RSA key parameters from file /azp/agent/.credentials_rsaparams
[2020-10-21 23:01:23Z INFO VisualStudioServices] AAD Correlation ID for this token request: Unknown
[2020-10-21 23:01:23Z INFO MessageListener] Session created.
[2020-10-21 23:01:23Z INFO Terminal] WRITE LINE: 2020-10-21 23:01:23Z: Listening for Jobs
[2020-10-21 23:01:23Z INFO JobDispatcher] Set agent/worker IPC timeout to 30 seconds.
[2020-10-21 23:31:24Z INFO MessageListener] No message retrieved from session 'dc2f77ad-6fdb-4a0f-b539-f0eefaef1c8d' within last 30 minutes.
[2020-10-21 23:56:26Z WARN VisualStudioServices] Authentication failed with status code 401.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

@dvmorris @KrylixZA @mk-AVA I’m closing this at the moment due to inactivity - please let us know if it’s still actual for you and provide more details - for us to investigate it further.

@KrylixZA Do you know the timestamp of hung job? When it should have started, but haven’t?

Relative to the logs I added above, it is between these two logged outputs:

[2021-01-29 13:11:23Z INFO JobDispatcher] Send job request message to worker for job 3bf2df57-857a-5250-2f8f-945c718af65b (30 KB). [2021-01-29 13:11:53Z INFO JobDispatcher] Job request message sending for job 3bf2df57-857a-5250-2f8f-945c718af65b been cancelled after waiting for 30 seconds, kill running worker.

More specifically, it is exactly at 13:11:23Z when the send job request message is made.