kubernetes: Notification of failed CronJobs
Chronos can notify users of failures of scheduled jobs via email. In a discussion with a potential K8s user, the question came up whether K8s will have an email-notification feature.
About the Chronos Feature:
- Chronos flags documentation
- option 1: HTTP notification on failure
- uses these flags:
--http_notification_credentials
and--http_notification_url
on the Chronos server.
- uses these flags:
- option 2: mail notification
- uses these flags:
--mail_from
,--mail_password
,--mail_server
,--mail_ssl
,--mail_user
.
- uses these flags:
Questions for Kubernetes:
- Should we do this at all, or let the community build their own solutions?
- If we do build something, is it a one-size fits all solution that we own, or is there an example implementation, with ways for plugging in alternatives.
- I think the answer is the latter, since alternative impls may have more features. For example, Google Cloud Monitoring provides alerting via “via email, SMS, PagerDuty, HipChat and more”.
- Similar to the way Google Cloud Logging can be used in place of elasticsearch logging?
- Setting up reasonable defaults here is the tricky part.
- Should we make email/http failure notification tightly coupled with ScheduledJob, like Chronos does, or create a generic mechanism for notifying about aribitrary events from controllers?
- I think the answer is generic, but care needs to be taken that the mechanism is not so flexible that configuring it is too hard. Reasonable defaults are needed.
- Is it triggered off of events, or off of the conditions of Objects.
- Events are nice because you don’t have to discover new object types.
- Conditions are considered a formal part of the API, while we have been fuzzy about events.
Possible implementations:
- nothing: do nothing by default. Assume users will use commercial (sysdig, Datadog, etc) solutions, and/or find examples on how to configure popular OSS to alert.
- De-novo: write a new go controller that has its own configuration object, and watches events (or conditions), and sends emails about matching events (or conditions).
- drawback: users have to learn new monitoring configuration language
- drawback: something else to support.
- advantage: if we own the configuration language, we can set good defaults more easily when kube-up-ing?
- ELK: shove some/all K8s events through ELK (elasticsearch/logstash/kibana) pipeline. Include an alert line in the example logstash config to alert on ScheduledJob failures.
- advantage: builds on our existing ELK setup. and tell users to alert on ScheduledJob failures.
- advantage: users of competing stacks will probably know how to mimic our example, as they will be used to mimic-ing ELK.
- advantage: possibility of showing failure event in context with log output from the job that failed.
- HIG: Write a go program that watches events or maybe directly watches objects, and exports the condition of objects, such as ScheduledJobs, as a time-series metric (e.g. failure-condition-count). Have those stats be collected by the Heapster-InfluxDB-Grafana pipeline or a Prometheus pipeline, and users can write alerts on changes in the time-series metric.
- disadvantage: prometheus supports email alerting, but we don’t have a standard prometheus setup
- disadvantage: HIG does not appear to support email alerting
- disadvantage: not clear if time-series is the best representation for a “failure event”.
I think I like 1 and 3.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 17 (12 by maintainers)
I need this feature, but I don’t think this should be in k8s. Instead, as @bgrant0607 said k8s should expose such events via API. @sstarcher’s work poll job status periodically, but ideally there should be a way to subscribe any job failure (not polling). Not sure k8s currently have them though. For example, lifecycle hook for CronJob would be another way…
I’m not 100% sure that CronJob api currently returns any type of information wrt to job failures. First thing, we don’t allow job failures, yet. See https://github.com/kubernetes/community/pull/583 for a solution to this problem. Once we have this in it’s worth working on returning such information to users, either via conditions or events. I doubt we do right now. But we should and that’s the direction I’d like this issue to go to.
The only forms of notification that should happen in the main repo are via the API: conditions and events.
Adapter components may be built in other repos to export status and events to other systems (monitoring/alerting, logging, databases, etc.).