airflow: Scheduler livenessProbe errors on new helm chart
Official Helm Chart version
1.6.0 (latest released)
Apache Airflow version
v2.1.2
Kubernetes Version
v1.22.10 (GKE version v1.22.10-gke.600)
Helm Chart configuration
Only livenessProbe config before and during the issue:
# Airflow scheduler settings
scheduler:
livenessProbe:
initialDelaySeconds: 10
timeoutSeconds: 15
failureThreshold: 10
periodSeconds: 60
Docker Image customisations
Here’s the image we use based on the apache airflow image:
### Main official airflow image
FROM apache/airflow:2.1.2-python3.8
USER root
RUN apt update
RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections
### Add OS packages here
## GCC compiler in case it's needed for installing python packages
RUN apt install -y -q build-essential
USER airflow
### Changing the default SSL / TLS mode for mysql client to work properly
## https://askubuntu.com/questions/1233186/ubuntu-20-04-how-to-set-lower-ssl-security-level
## https://bugs.launchpad.net/ubuntu/+source/mysql-8.0/+bug/1872541
## https://stackoverflow.com/questions/61649764/mysql-error-2026-ssl-connection-error-ubuntu-20-04
RUN echo $'openssl_conf = default_conf\n\
[default_conf]\n\
ssl_conf = ssl_sect\n\
[ssl_sect]\n\
system_default = ssl_default_sect\n\
[ssl_default_sect]\n\
MinProtocol = TLSv1\n\
CipherString = DEFAULT:@SECLEVEL=1' >> /home/airflow/.openssl.cnf
## OS env var to point to the new openssl.cnf file
ENV OPENSSL_CONF=/home/airflow/.openssl.cnf
### Add airflow providers
RUN pip install apache-airflow-providers-apache-beam
### End airflow providers
### Add extra python packages
RUN pip install python-slugify==3.0.3
### End extra python packages
What happened
After the upgrade to the helm chart 1.6.0
, the scheduler POD was restarting as the livenessProbe
was failing.
Command for the new livenessProbe
from helm chart 1.6.0
tested directly on our scheduler POD:
airflow@yc-data-airflow-scheduler-0:/opt/airflow$ CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
> airflow jobs check --job-type SchedulerJob --hostname $(hostname)
No alive jobs found.
Removing the --hostname
argument works and a live job is found
:
airflow@yc-data-airflow-scheduler-0:/opt/airflow$ CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
> airflow jobs check --job-type SchedulerJob
Found one alive job.
What you think should happen instead
livenessProbe
should not error.
How to reproduce
No response
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 2
- Comments: 19 (5 by maintainers)
I have the same issue with airflow
2.4.1
to checkliveness
ondag-processor
’s pod. Default is not working for me:as well as with
$(hostname -i)
:Only after removing
--hostname
flag it works:hey @fernhtls, I think that the issue might not be directly related to AIRFLOW__CORE__HOSTNAME_CALLABLE, however, might be this is obvious but still as per your findings it is related to the fact that pod hostname not being resolved. So as a workaround
$(hostname -i)
instead of$(hostname)
may be used as well in liveliness probe command.hi @ilyadinaburg , I didn’t have the time to check properly the whole issue on itself, in my case I’m pushing a custom
livenessProbe
to the helm chart without the--hostname
argument as below:I didn’t check the env var
AIRFLOW__CORE__HOSTNAME_CALLABLE
, as I had to push a quick fix as the scheduler was restarting and breaking the execution of DAGs / tasks.Good that the issue seems to be reproducible, as you’re using the same new latest helm chart version as I am (
1.6.0
).I have same issue with chart 1.9.0 and airflow 2.5.3. What did I missed?
@Subhashini2610 have you upgraded the chart to
1.7.0
? The issue is solved for me using chart 1.7.0 and airflow 2.5.0.@eduardchai
use airlfow 2.5.0 we support liveness cmd arg
--local
start in 2.5.0@VladimirYushkevich I believe the Standalone DAG Processor issue is totally different. I have documented my theory in the following issue: https://github.com/apache/airflow/issues/27140
$(hostname -f)
worked for us