amazon-ecs-plugin: Agents fail to provision after restart

After Jenkins has been restarted, agents fail to provision with the following messages in the logs:

May 22, 2020 6:38:55 AM FINE com.cloudbees.jenkins.plugins.amazonecs.ECSLauncher
ECS: Launching agent
May 22, 2020 6:38:55 AM FINE com.cloudbees.jenkins.plugins.amazonecs.ECSLauncher
[ecs-cloud-ecs-main-fmcpc]: Creating Task in cluster null
May 22, 2020 6:38:55 AM WARNING com.cloudbees.jenkins.plugins.amazonecs.ECSLauncher launch
[ecs-cloud-ecs-main-fmcpc]: Error in provisioning; agent=com.cloudbees.jenkins.plugins.amazonecs.ECSSlave[ecs-cloud-ecs-main-fmcpc]
java.lang.NullPointerException
	at com.cloudbees.jenkins.plugins.amazonecs.ECSService.registerTemplate(ECSService.java:150)
	at com.cloudbees.jenkins.plugins.amazonecs.ECSLauncher.getTaskDefinition(ECSLauncher.java:205)
	at com.cloudbees.jenkins.plugins.amazonecs.ECSLauncher.launch(ECSLauncher.java:107)
	at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:292)
	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

May 22, 2020 6:38:55 AM FINER com.cloudbees.jenkins.plugins.amazonecs.ECSLauncher
[ecs-cloud-ecs-main-fmcpc]: Removing Jenkins node

All builds using ECS agents fail the same way. For context, we use Fargate agents in declarative pipelines, some with overrides on memory, cpu or image

Modifying and saving the agent config resolves the issue temporarily, but it returns as soon as Jenkins is restarted.

  • Jenkins v2.222.3
  • amazon-ecs-plugin v1.34

~The bit that caught my attention in the logs was Creating Task in cluster null - presumably that’s not a good sign? Any ideas why the cluster would be null after a restart?~ (this appears to be unrelated, even successful provisioning has this)

This only seems to have begun occurring after we upgraded from v1.26 of the plugin.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 10
  • Comments: 26 (5 by maintainers)

Commits related to this issue

Most upvoted comments

Fix that worked for me: go to https://<jenkins>/configureClouds/ and click Save, then delete the nodes which were being created.

I think I have worked out the fix for at least one variation of this (the NPE on registerTemplate).

It was difficult to debug, as I believe the ECSCloud class is serialized; I could never get the debugger to pause on the constructor so concluded it may have been serialized. From my debugging ECSService would always end up with its Supplier = null after restart. Thankfully ECSService is already lazy-loaded via a call to ECSCloud.getEcsService(), so it doesn’t actually need to be preserved with ECSCloud. I switched that field to transient, and have had several successful restarts where I don’t encounter this error anymore.

I’ve created PR #216 if anybody would like to test and verify they see the same success

Some additional detail on curious behaviour: If the first build after a restart does not override the agent settings, then everything seems to be fine thereafter

agent { label 'ecs-main' }

However if the first build after a restart uses an agent with settings overridden (e.g. image, cpu, memory) then this error will occur until the Cloud configuration is re-saved.

agent { 
    ecs {
        inheritFrom 'ecs-main'
        cpu 256
        memory 512
    }
}

Any ideas what might cause this/if we are doing something unintended?