cloud-custodian: aws - error handling & concurrency pattern in actions/filters revisit
When specifying a policy on an EBS volume with actions set to delete then notify, delete failures result in the notification still firing. I’m concerned that if I wanted to ensure a snapshot is created before delete, snapshot creation failures might not keep the delete from being attempted.
@kapilt looked and said this module should get cleaned up to properly report errors
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 30 (18 by maintainers)
i was also thinking about adding functionality regarding exception handling. For example, the max-resource configuration is an exception so errors out when triggered, but i might want to trigger a notify action instead. so similar to @tjstansell
Perhaps we should let each policy define its own error handling behavior. It seems that each policy has it’s own dependencies in that regard. If I have a policy like I’ve suggested with actions of:
What I’d like is to have the ability for the
notifyaction to always fire and be able to report on all resources, but show which actions were successful and which were failures. I would also assume that only resources that were successfully snapshotted would get deleted. So maybe the resource list passed to each action should only include the successfully processed resources from the previous action, but the return status from all previous actions and all previous resources could be available for the action to reference?Or maybe it could support some sort of
on_errorsetting where you can define a single(?) action to perform on any resources that failed at a particular action (where successful resources would keep progressing through the list of actions):I was also brainstorming the idea of supporting something like a try, except, finally syntax to define actions:
but I’m not sure there’s really a case for an
exceptblock … but i do like the explicit nature of how that shows thenotifyaction will always be attempted regardless of how manytryactions worked. Again, I think the interesting part here, is what the payload looks like that’s being passed to thefinallyaction. Somehow, it would be great if the status if all of the resources actions could be referenced, but I honestly don’t know how complicated that might get … both to track and pass down as well as for each action to parse.I guess in my head the
resourceslisted for each action would represent which resources that action should take action on. This would be the set that have passed all previous actions. There could also be aresultssection somewhere that would show the results of all previous actions.Anyway, that’s a long ramble, but as a consumer of this I wanted to throw my two cents in.
Thanks!
I started thinking about what kind of filtering would be most common for
notifyactions … or any other, really … and came up with this proposal:Use a new
with_resultparameter to determine which resources to include in the action with these possible values:ok,skip,error– look for this specific result from the last actionsuccess– resources that have not had any errors from any previous actions (this would be the default)failure– resources that have had errors in any previous actionsany– all resources are included … no filteringThis new
with_resultparameter would only be available to the following actions:notifyinvoke-lambdawebhookHere’s an example:
or alternatively:
which would send a failure notification, success notification, then invoke a lambda with all the resources and the lambda could look at the
c7n:Actionsannotation to see if it failed or succeeded or whatever and do something else.What about the case where you have multiple actions … there should be a way to at least notify folks of the actions that successfully happened to resources, even if not all actions were successful. Are you proposing that we need some way to monitor the metrics to detect when something failed, then somehow determine which resources failed, which relies on digging through cloudwatch logs from lambdas? … and create a notification from that? I’m not even sure how someone would do that.
I mean, moving to a place where we know a resource will move on to the next step only if it succeeds is great… (which is basically how I think it’s supposed to be working today) but I think many of us are hoping to have more control than that. I was really hoping that at the end of each action there was a way to trigger a specific action (probably just
notify) for anything that failed. Or, at the end of the entire policy run, allow a way to callnotifyand include a payload where we can determine which step each resource successfully completed.I’m guessing that most of us just want a rich notification that accurately represents what happened rather than wanting to try to do any other specific actions upon failure.
Perhaps chaining policies is one way to achieve this, but the current limitations of how caching works would prevent you from doing both actions in a single run. For instance: a policy to take a backup of a dynamodb table, then a second policy that deletes that table, but filtered only on tables who’ve had a backup within X hours (not that that filter exists yet, afaik). You’d have to run custodian with just the first policy, then again with the second to keep the original resource state that’s cached from being used in the second policy.