airflow: Unreachable Secrets Backend Causes Web Server Crash
Apache Airflow version:
1.10.12
Kubernetes version (if you are using kubernetes) (use kubectl version
):
n/a
Environment:
-
Cloud provider or hardware configuration: Amazon MWAA
-
OS (e.g. from /etc/os-release): Amazon Linux (latest)
-
Kernel (e.g.
uname -a
): n/a -
Install tools: n/a
What happened:
If an unreachable secrets.backend is specified in airflow.cfg the web server crashes
What you expected to happen:
An invalid secrets backend should be ignored with a warning, and the system should default back to the metadatabase secrets
How to reproduce it:
In an environment without access to AWS Secrets Manager, add the following to your airflow.cfg:
[secrets]
backend = airflow.contrib.secrets.aws_secrets_manager.SecretsManagerBackend
or an environment without access to SSM specifiy:
[secrets]
backend = airflow.contrib.secrets.aws_systems_manager.SystemsManagerParameterStoreBackend
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 17 (17 by maintainers)
As of now it seems the expected behavior is not what is happening and is inconsistent across different secret backends.
I have tried to reproduce this issue with Airflow 2.0 (main branch) and am not able to do so for any AWS secrets backends. I was only able to reproduce a crashing webserver for GCP Secret Manager and not any other secrets backend.
The GCP Secret Manager error seems more to do with the function to get the credentials and not the actual connection.
I used the
airflow.providers.*
secrets packages for each. I noticed that the original post on the issue uses thecontrib
package and Airflow1.10.12
.Here are my findings:
I believe we should evaluate what the expected behavior should be as compared to what is actually happening.
Also after discussing with @kaxil there may be a middle ground for fail over implementation that could make sense here.
Agree we have consistency issue here - Interestingly, the AWS secret manager crashed originally for @subashcanapathy and @john-jac but did not crash for you @fhoda. Not sure what the reason is for that (maybe the 1.10 vs 2.* behavioral difference)?
I really like the idea of different behavior for different type of access. I think it answers my concerns perfectly and what it really boils down to is “who” is the “client” - whether it is “airflow” or the “DAG/task writer”.
I think the main difference of configuration vs. variables and connections is that Airflow has default values for most of the configurations and when they are not found, they will fall-back to the default values - which might alter behavior of airflow. So lack of secrets backend when it is configured and configuration is retrieved is very dangerous. And since it is accessed under-the hood by Airflow, without the “dag” or “task” using it, it’s airflow that is the “client” and it’s airflow that should handle it (and crashing is the only reasonable behavior IMHO). Simply “dag writer” is not in a control to make any decision here.
This is (as you rightfully noticed), far less of a concern for connections and variables - “clients” for those are “dag writers”. Whoever uses them should be prepared for what happens when the secret backend is missing. Either the “writers” will prepare fallback values for those in the DB or they will have to handle “missing” value somehow (and this is up to the ‘user’ what to do in this case). But they are in full control, there is no need to crash Airflow (yet! - until configuration is not accessed by Airflow itself).
Reopening it as it might actually be an actionable item to do 😃
@subashcanapathy , @john-jac - would that be a reasonable approach for you as well ?
A web application that is stateless (or intended to be that in long term) should fail gracefully on such situations. A secrets provider is at best a plugin and not a core feature. I would understand if the meta-DB connection failed and or the IDP provider connection failed - then it makes sense to prevent startup. If the customer made a mistake on configuring the secrets backend, booting up the web UI will make it obvious to them as the task failures and logs are viewable. Without this we are just assuming the user has access to logs on a box to be able to even understand where this went wrong.
Airflow webserver should be stateless representation of current state of things in the environment. I request to consider re-opening this so that we can make a configuration control like
webserver.failsafe_secrets_backend = true
@potiuk @kaxil
Airflow is a distributed system there is never ‘single’ operation. Even if your operation does not need it, there can be many more tasks and jobs that might. They share co figuration and processes sometimes.
Really trying to pass on the bad configuration problem form the operator who should configure Airflow well to Airflow itself is very bad idea. Really crashing is the best signal Airflow can give to the human operating it ‘please fix it’.