airflow: Scheduler livenessProbe errors on new helm chart

Official Helm Chart version

1.6.0 (latest released)

Apache Airflow version

v2.1.2

Kubernetes Version

v1.22.10 (GKE version v1.22.10-gke.600)

Helm Chart configuration

Only livenessProbe config before and during the issue:

# Airflow scheduler settings
scheduler:
  livenessProbe:
    initialDelaySeconds: 10
    timeoutSeconds: 15
    failureThreshold: 10
    periodSeconds: 60

Docker Image customisations

Here’s the image we use based on the apache airflow image:

### Main official airflow image
FROM apache/airflow:2.1.2-python3.8

USER root

RUN apt update

RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

### Add OS packages here
## GCC compiler in case it's needed for installing python packages
RUN apt install -y -q build-essential

USER airflow

### Changing the default SSL / TLS mode for mysql client to work properly
## https://askubuntu.com/questions/1233186/ubuntu-20-04-how-to-set-lower-ssl-security-level
## https://bugs.launchpad.net/ubuntu/+source/mysql-8.0/+bug/1872541
## https://stackoverflow.com/questions/61649764/mysql-error-2026-ssl-connection-error-ubuntu-20-04
RUN echo $'openssl_conf = default_conf\n\
[default_conf]\n\
ssl_conf = ssl_sect\n\
[ssl_sect]\n\
system_default = ssl_default_sect\n\
[ssl_default_sect]\n\
MinProtocol = TLSv1\n\
CipherString = DEFAULT:@SECLEVEL=1' >> /home/airflow/.openssl.cnf
## OS env var to point to the new openssl.cnf file
ENV OPENSSL_CONF=/home/airflow/.openssl.cnf

### Add airflow providers
RUN pip install apache-airflow-providers-apache-beam
### End airflow providers

### Add extra python packages
RUN pip install python-slugify==3.0.3
### End extra python packages

What happened

After the upgrade to the helm chart 1.6.0, the scheduler POD was restarting as the livenessProbe was failing.

Command for the new livenessProbe from helm chart 1.6.0 tested directly on our scheduler POD:

airflow@yc-data-airflow-scheduler-0:/opt/airflow$ CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
> airflow jobs check --job-type SchedulerJob --hostname $(hostname)
No alive jobs found.

Removing the --hostname argument works and a live job is found:

airflow@yc-data-airflow-scheduler-0:/opt/airflow$ CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
> airflow jobs check --job-type SchedulerJob
Found one alive job.

What you think should happen instead

livenessProbe should not error.

How to reproduce

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 19 (5 by maintainers)

Most upvoted comments

I have the same issue with airflow 2.4.1 to check liveness on dag-processor’s pod. Default is not working for me:

dag-processor-ccb9f9949-7zdtv:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --hostname $(hostname)"
No alive jobs found.

as well as with $(hostname -i):

dag-processor-ccb9f9949-7zdtv:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --hostname $(hostname -i)"
No alive jobs found.

Only after removing --hostname flag it works:

dag-processor-ccb9f9949-7zdtv:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check"
Found one alive job.

hey @fernhtls, I think that the issue might not be directly related to AIRFLOW__CORE__HOSTNAME_CALLABLE, however, might be this is obvious but still as per your findings it is related to the fact that pod hostname not being resolved. So as a workaround $(hostname -i) instead of $(hostname) may be used as well in liveliness probe command.

hi @ilyadinaburg , I didn’t have the time to check properly the whole issue on itself, in my case I’m pushing a custom livenessProbe to the helm chart without the --hostname argument as below:

scheduler:
  livenessProbe:
    initialDelaySeconds: 10
    timeoutSeconds: 15
    failureThreshold: 10
    periodSeconds: 60
    command:
      - sh
      - -c
      - |
      	CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
      	airflow jobs check --job-type SchedulerJob

I didn’t check the env var AIRFLOW__CORE__HOSTNAME_CALLABLE, as I had to push a quick fix as the scheduler was restarting and breaking the execution of DAGs / tasks.

Good that the issue seems to be reproducible, as you’re using the same new latest helm chart version as I am (1.6.0).

I have same issue with chart 1.9.0 and airflow 2.5.3. What did I missed?

@Subhashini2610 have you upgraded the chart to 1.7.0? The issue is solved for me using chart 1.7.0 and airflow 2.5.0.

@eduardchai

use airlfow 2.5.0 we support liveness cmd arg --local start in 2.5.0

@VladimirYushkevich I believe the Standalone DAG Processor issue is totally different. I have documented my theory in the following issue: https://github.com/apache/airflow/issues/27140

$(hostname -f) worked for us