airflow: Unreachable Secrets Backend Causes Web Server Crash

Apache Airflow version:

1.10.12

Kubernetes version (if you are using kubernetes) (use kubectl version):

n/a

Environment:

Cloud provider or hardware configuration: Amazon MWAA
OS (e.g. from /etc/os-release): Amazon Linux (latest)
Kernel (e.g. uname -a): n/a
Install tools: n/a

What happened:

If an unreachable secrets.backend is specified in airflow.cfg the web server crashes

What you expected to happen:

An invalid secrets backend should be ignored with a warning, and the system should default back to the metadatabase secrets

How to reproduce it:

In an environment without access to AWS Secrets Manager, add the following to your airflow.cfg:

[secrets]
backend = airflow.contrib.secrets.aws_secrets_manager.SecretsManagerBackend

or an environment without access to SSM specifiy:

[secrets]
backend = airflow.contrib.secrets.aws_systems_manager.SystemsManagerParameterStoreBackend

Reference: https://airflow.apache.org/docs/apache-airflow/1.10.12/howto/use-alternative-secrets-backend.html#aws-ssm-parameter-store-secrets-backend

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 17 (17 by maintainers)

Commits related to this issue

Secrets backend failover (#16404) Currently Airflow does not check the default secrets backends (env and metastore db) if there is any sort of connection related error to an Alternative Backend, caus... — committed to apache/airflow by fhoda 3 years ago

Most upvoted comments

As of now it seems the expected behavior is not what is happening and is inconsistent across different secret backends.

I have tried to reproduce this issue with Airflow 2.0 (main branch) and am not able to do so for any AWS secrets backends. I was only able to reproduce a crashing webserver for GCP Secret Manager and not any other secrets backend.

The GCP Secret Manager error seems more to do with the function to get the credentials and not the actual connection.

google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authenticati│google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentica
on/getting-started

I used the airflow.providers.* secrets packages for each. I noticed that the original post on the issue uses the contrib package and Airflow 1.10.12.

#export AIRFLOW__SECRETS__BACKEND=airflow.providers.hashicorp.secrets.vault.VaultBackend
#export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_path": "airflow/connections", "variables_path": "airflow/variables", "config_path": "airflow/config", "url": "http://127.0.0.1:8200", "token": "$VAULT_TOKEN"}'


export AIRFLOW__SECRETS__BACKEND=airflow.providers.google.cloud.secrets.secret_manager.CloudSecretManagerBackend
export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "airflow-connections", "variables_prefix": "airflow-variables", "gcp_keyfile_dict": $GCP_SECRET_MANAGER_SA_KEY}'


#export AIRFLOW__SECRETS__BACKEND=airflow.providers.microsoft.azure.secrets.azure_key_vault.AzureKeyVaultBackend
#export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "airflow-connections", "variables_prefix": null, "vault_url": "https://example-akv-resource-name.vault.azure.net/"}'


#export AIRFLOW__SECRETS__BACKEND=airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
#export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "airflow/connections", "variables_prefix": "airflow/variables", "profile_name": "default"}'



#export AIRFLOW__SECRETS__BACKEND=airflow.providers.amazon.aws.secrets.systems_manager.SystemsManagerParameterStoreBackend
#export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "/airflow/connections", "variables_prefix": "/airflow/variables", "profile_name": "default"}'

Here are my findings:

AWS Secret Manager - No Crash
AWS SSM - No Crash
Vault - No Crash
Azure Key Vault - No Crash
GCP Secret Manager - Crash

I believe we should evaluate what the expected behavior should be as compared to what is actually happening.

Also after discussing with @kaxil there may be a middle ground for fail over implementation that could make sense here.

If configs are being retrieved through the secrets backend then a failure makes sense.
If connections and/or variables are not able to be retrieved, then fail over could be a strategy used by users to ensure DAG/task success and predictable execution.

fhoda on Jun 7, 2021

Agree we have consistency issue here - Interestingly, the AWS secret manager crashed originally for @subashcanapathy and @john-jac but did not crash for you @fhoda. Not sure what the reason is for that (maybe the 1.10 vs 2.* behavioral difference)?

I really like the idea of different behavior for different type of access. I think it answers my concerns perfectly and what it really boils down to is “who” is the “client” - whether it is “airflow” or the “DAG/task writer”.

I think the main difference of configuration vs. variables and connections is that Airflow has default values for most of the configurations and when they are not found, they will fall-back to the default values - which might alter behavior of airflow. So lack of secrets backend when it is configured and configuration is retrieved is very dangerous. And since it is accessed under-the hood by Airflow, without the “dag” or “task” using it, it’s airflow that is the “client” and it’s airflow that should handle it (and crashing is the only reasonable behavior IMHO). Simply “dag writer” is not in a control to make any decision here.

This is (as you rightfully noticed), far less of a concern for connections and variables - “clients” for those are “dag writers”. Whoever uses them should be prepared for what happens when the secret backend is missing. Either the “writers” will prepare fallback values for those in the DB or they will have to handle “missing” value somehow (and this is up to the ‘user’ what to do in this case). But they are in full control, there is no need to crash Airflow (yet! - until configuration is not accessed by Airflow itself).

Reopening it as it might actually be an actionable item to do 😃

@subashcanapathy , @john-jac - would that be a reasonable approach for you as well ?

potiuk on Jun 7, 2021

A web application that is stateless (or intended to be that in long term) should fail gracefully on such situations. A secrets provider is at best a plugin and not a core feature. I would understand if the meta-DB connection failed and or the IDP provider connection failed - then it makes sense to prevent startup. If the customer made a mistake on configuring the secrets backend, booting up the web UI will make it obvious to them as the task failures and logs are viewable. Without this we are just assuming the user has access to logs on a box to be able to even understand where this went wrong.

Airflow webserver should be stateless representation of current state of things in the environment. I request to consider re-opening this so that we can make a configuration control like webserver.failsafe_secrets_backend = true

@potiuk @kaxil

subashcanapathy on Jun 2, 2021

@potiuk Thank you very much for the detailed reasoning process! The concern is that if the an operation doesn’t even need secrets backend (no need to use any connection or variable), should it crash or continue working? But generally it is a good idea to crash due to a bad configuration when starting an APP.

Airflow is a distributed system there is never ‘single’ operation. Even if your operation does not need it, there can be many more tasks and jobs that might. They share co figuration and processes sometimes.

Really trying to pass on the bad configuration problem form the operator who should configure Airflow well to Airflow itself is very bad idea. Really crashing is the best signal Airflow can give to the human operating it ‘please fix it’.

potiuk on Jun 2, 2021