cloud-custodian: aws - error handling & concurrency pattern in actions/filters revisit

When specifying a policy on an EBS volume with actions set to delete then notify, delete failures result in the notification still firing. I’m concerned that if I wanted to ensure a snapshot is created before delete, snapshot creation failures might not keep the delete from being attempted.

@kapilt looked and said this module should get cleaned up to properly report errors

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 30 (18 by maintainers)

Commits related to this issue

Most upvoted comments

i was also thinking about adding functionality regarding exception handling. For example, the max-resource configuration is an exception so errors out when triggered, but i might want to trigger a notify action instead. so similar to @tjstansell

 max-resource:50
 resource: ebs
 actions:
  - type: delete
 on_error:
  - exception: ResourceLimitExceeded
    type: notify 

Perhaps we should let each policy define its own error handling behavior. It seems that each policy has it’s own dependencies in that regard. If I have a policy like I’ve suggested with actions of:

  resource: ebs
  actions:
  - type: snapshot
  - type: delete
  - type: notify

What I’d like is to have the ability for the notify action to always fire and be able to report on all resources, but show which actions were successful and which were failures. I would also assume that only resources that were successfully snapshotted would get deleted. So maybe the resource list passed to each action should only include the successfully processed resources from the previous action, but the return status from all previous actions and all previous resources could be available for the action to reference?

Or maybe it could support some sort of on_error setting where you can define a single(?) action to perform on any resources that failed at a particular action (where successful resources would keep progressing through the list of actions):

  resource: ebs
  actions:
  - type: snapshot
    on_error:
      type: notify
      msg: These resources failed to snapshot.
  - type: delete
    on_error:
      type: notify
      msg: These resources were snapshot, but failed to delete.
  - type: notify
      msg: These resources were deleted after being snapshotted.

I was also brainstorming the idea of supporting something like a try, except, finally syntax to define actions:

  resource: ebs
  actions:
    - try:
      - type: snapshot
      - type: delete
    - finally:
      - type: notify

but I’m not sure there’s really a case for an except block … but i do like the explicit nature of how that shows the notify action will always be attempted regardless of how many try actions worked. Again, I think the interesting part here, is what the payload looks like that’s being passed to the finally action. Somehow, it would be great if the status if all of the resources actions could be referenced, but I honestly don’t know how complicated that might get … both to track and pass down as well as for each action to parse.

I guess in my head the resources listed for each action would represent which resources that action should take action on. This would be the set that have passed all previous actions. There could also be a results section somewhere that would show the results of all previous actions.

Anyway, that’s a long ramble, but as a consumer of this I wanted to throw my two cents in.

Thanks!

I started thinking about what kind of filtering would be most common for notify actions … or any other, really … and came up with this proposal:

Use a new with_result parameter to determine which resources to include in the action with these possible values:

  • ok, skip, error – look for this specific result from the last action
  • success – resources that have not had any errors from any previous actions (this would be the default)
  • failure – resources that have had errors in any previous actions
  • any – all resources are included … no filtering

This new with_result parameter would only be available to the following actions:

  • notify
  • invoke-lambda
  • webhook

Here’s an example:

actions:
  - snapshot
  - type: notify
    with_result: error
    action_desc: These resources failed to get snapshotted before being deleted.
  - delete
  - type: notify
    with_result: error
    action_desc: These resources had a snapshot created, but could not be deleted.
  - type: notify
     action_desc: These resources have had a snapshot created and have been deleted.

or alternatively:

actions:
  - snapshot
  - delete
  - type: notify
    with_result: failure
    action_desc: These resources have failed to be snapshotted or deleted.
  - type: notify
    with_result: success
    action_desc: These resources have had a snapshot created and have been deleted.
  - type: invoke-lambda
    function: my-function
    with_result: any

which would send a failure notification, success notification, then invoke a lambda with all the resources and the lambda could look at the c7n:Actions annotation to see if it failed or succeeded or whatever and do something else.

What about the case where you have multiple actions … there should be a way to at least notify folks of the actions that successfully happened to resources, even if not all actions were successful. Are you proposing that we need some way to monitor the metrics to detect when something failed, then somehow determine which resources failed, which relies on digging through cloudwatch logs from lambdas? … and create a notification from that? I’m not even sure how someone would do that.

I mean, moving to a place where we know a resource will move on to the next step only if it succeeds is great… (which is basically how I think it’s supposed to be working today) but I think many of us are hoping to have more control than that. I was really hoping that at the end of each action there was a way to trigger a specific action (probably just notify) for anything that failed. Or, at the end of the entire policy run, allow a way to call notify and include a payload where we can determine which step each resource successfully completed.

I’m guessing that most of us just want a rich notification that accurately represents what happened rather than wanting to try to do any other specific actions upon failure.

Perhaps chaining policies is one way to achieve this, but the current limitations of how caching works would prevent you from doing both actions in a single run. For instance: a policy to take a backup of a dynamodb table, then a second policy that deletes that table, but filtered only on tables who’ve had a backup within X hours (not that that filter exists yet, afaik). You’d have to run custodian with just the first policy, then again with the second to keep the original resource state that’s cached from being used in the second policy.