azure-pipelines-agent: Forced Azure DevOps updates of the azure-pipelines-agent is incompatible with `--once`

Description of the Issue:

We have a requirement for single-use on-premises agents and so we pass the --once flag in order to guarantee that the agents are ‘clean’ new agents. We have recently migrated to Azure DevOps from on-premises TFS 2017.

This is having a very serious impact on our development workflow as described below:

  • Agents receive a forced push update of the latest azure DevOps pre-release (at a random moment and unnotified)

  • As the agents are set in --once mode, the agent interprets the update as a job and then self-terminates on completion of the update.

  • We have a scheduler internally which will spawn new agents (with a specific version of the agent itself) but these are also forced updated and so the self-termination occurs again and again

Net effect: The forced updates take down our entire pool of agents rendering the CI system totally unusable

Our internal response is as follows:

  • Scramble to update our internal images with the latest version of the pre-release azure-pipelines-agent (when we notice that no builds are running)

  • Scramble to redeploy the affected agent pools

  • This process takes up to 2 hours

Proposed Solutions:

Proposal 1:

I would like to propose that the behaviour of the --once flag is modified, as follows:

  • The logic of the --once agents is modified such that the forced updates of the agents are not treated like normal jobs and are flagged as some sort of special job

  • These special jobs do not trigger self-termination of the agent and instead trigger a restart of the agent

  • This exception should only apply to a special case of the forced updates of the azure-pipelines-agent

Proposal 2:

  • Allow users of Azure DevOps a range of eligible versions of the azure-pipelines-agents

  • Give users of Azure DevOps a window of time to update their azure-pipeline-agent version (some form of a warning notification would be ideal, similar with code deprecation warnings)

Your feedback and support on this issue would be greatly appreciated.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

@stephenmoloney - Yes, this was something we fixed on our side to prevent auto updates of the agent unless you are using certain pipeline features that require a newer version of the agent. I believe the issue you are seeing is a result of some updates that allow us to downgrade an agent if we detect a problem while we are rolling it out. If you are managing your own agent, there is a way to disable that.

@damccorm - Do we have that documented anywhere?

@giddyelysium13 - I think this issue is related to #2475. I am working to get this issue prioritized soon.

@giddyelysium13 - Thank you for sending me the logs. Your assessment seems to be accurate. From what I can piece together, the first attempt to do the update fails with the “Access is denied” error. The subsequent update request succeeds.

Has this setup worked previously without issue?

I have seen the occasional “Access is denied” issue when attempting to move the bin/ directory on Windows and have an open issue to have this operation retry, but I have not heard of it happening this often.

Is there something about your setup that has some process using the bin/ directory as the working directory?

@giddyelysium13 - If you could send me the _diag logs and I can dig in and try and figure out what might have happened.

As to the second issue, this is a known issue and we are working to roll out a fix as it requires an update to both the service and the agent.

@giddyelysium13 - Ok, I wanted to confirm that you were starting the agent via the run.cmd script. That script is what is supposed to restart the agent if it updates in --once mode. If you were running the Agent.Listener executable directly, that might explain the problem.

You can email at tommy.petty@microsoft.com with the logs and I will take a look to see if I can figure out what might be causing this issue.

As a heads up, we published a new agent version on Friday and have begun to roll it out.

@giddyelysium13 - Also, how are you starting up the agent?

@giddyelysium13 - I apologize for the inconvenience this event caused you and your team.

The intended behavior of the agent is that update requests do not count as the one job the agent is supposed to run.

Do you happen to have any of the logs from the agents that updated and then did not wait for a job?

Is it possible there was something about the way the agent was configured that caused the update process to fail?

@alex-peck - Can you verify that we have not introduced a bug into agent with respect to this scenario?