pulumi-awsx: Fargate service updating failing with timeout

I am using the following code to deploy container (some things with certificate, route53 dropped out to simplify )

 const nlb = new awsx.elasticloadbalancingv2.NetworkLoadBalancer("some-nlb", {external: true });
 const tg = nlb.createTargetGroup("some-tg", { port: 3000 });
 const listener = tg.createListener("some-http-listener", { port: 80 });
const listener2 = tg.createListener('https-listener', {
    certificateArn: certificate.arn,
     loadBalancerArn: nlb.arn,
    port: 443,
    protocol: "TLS",
    sslPolicy: "ELBSecurityPolicy-2016-08"
}); 

let service = new awsx.ecs.FargateService("some-app", {
    desiredCount: 1,
    taskDefinitionArgs: {
        containers: {
            myapp: {
                image: awsx.ecs.Image.fromPath("app", "../"),
                memory: 512,
                portMappings: [listener],
            },
        },
    },
});

It deploys correctly, but if I try to run it again, it stucks on the updating taskDefinition, and throw timeout error:

    ├─ awsx:x:ecs:FargateTaskDefinition  
+-  │  └─ aws:ecs:TaskDefinition         replaced                [diff: ~containerDefinitions]
    └─ awsx:x:ecs:FargateService         
~      └─ aws:ecs:Service                **updating failed**     [diff: ~taskDefinition]; 1 error


error: Plan apply failed: 1 error occurred:
       * updating urn:pulumi:dev::drjones::awsx:x:ecs:FargateService$aws:ecs/service:Service::service: timeout while waiting for state to become 'true' (last state: 'false', timeout: 10m0s)

And it always crashes on process updating taskDefinition, after this, I have +1 running task, but not replacing existing. But my changes are successfully applied (which I made in app code). So I see the changes which I made in container, but script fail and I have 2 tasks not one.

After some time this second task is stopped on aws. But script on the next pulumi up run tries to delete task from previous deployment.

Weird situation, which affects consistency, and makes it impossible to do updating correctly. Anyway it looks like some processing up to 10 min and then fail.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 6
  • Comments: 30 (12 by maintainers)

Most upvoted comments

been having a similar issue lately

Updating (ef):

     Type                                 Name      Status                  Info
     pulumi:pulumi:Stack                  infra-ef  **failed**              1 error
     ├─ awsx:x:ecs:FargateTaskDefinition  app-svc2                          1 warning
     └─ awsx:x:ecs:FargateService         app-svc2
 ~      └─ aws:ecs:Service                app-svc2  **updating failed**     1 error

Diagnostics:
  awsx:x:ecs:FargateTaskDefinition (app-svc2):

  aws:ecs:Service (app-svc2):
    error: Plan apply failed: 1 error occurred:
        * updating urn:pulumi:ef::infra::awsx:x:ecs:FargateService$aws:ecs/service:Service::app-svc2: timeout while waiting for state to become 'true' (last state: 'false', timeout: 10m0s)

  pulumi:pulumi:Stack (infra-ef):
    error: update failed

Resources:
    37 unchanged

Duration: 10m45s

Is there a solution to this issue? running again tends to not always work

I totally missed there are WithContext methods, so I think we may want to consider using that in order to update the timeout.

I think ideally, we’d prefer a way to have this work cleanly instead of reverting the upstream waiter logic. I’ve filed https://github.com/aws/aws-sdk-go/issues/3844 to help with that. Separately, I think we could consider updating some of the defaults in awsx itself to make this less likely to happen.

@duro I assume this is a case where the Fargate Service does ultimately succeed in it’s update (even though Pulumi times out)? Could you share a log from your ECS console on the events that happened during the update? Greater than 15 minutes for an ECS Service update is surprising under any conditions I’m aware of - so would love to better understand so we can identify the right default here.

This definitely needs to be addressed. I am getting this failure 4 out of 5 times. Everytime the cluster does actually reach the correct state, only Pulumi fails. I even tried adding this to the resource options:

customTimeouts: { update: '20m' }

But it has not effect.