airflow: Airflow Elasticsearch configuration log output does not contain required elements
Context What I’m going to describe here is more or less related to #7141. But the context is bigger than described there. That’s the reason why I open another Bug report. I’m struggling with this issue, too. I have tried a lot to be able to see the log entries in web GUI. Maybe I’m wrong, but in my view all this Elasticsearch process cannot be used at all. I followed all the discussion in #7141 but it seems like that some of the points are still not clear. I try to explain the issues in more high level instead of talking Python.
My configuration:
[elasticsearch]
host = http://elasticsearch_de:9200
log_id_template = {{dag_id}}-{{task_id}}-{{execution_date}}-{{try_number}}
end_of_log_mark = end_of_log
write_stdout = True
json_format = True
json_fields = execution_date, asctime, filename, lineno, levelname, message
My example, the pure (snipped) console log output after a DAG has been executed:
{"execution_date": "2020_09_21T14_00_00_000000", "asctime": null, "filename": "taskinstance.py", "lineno": 670, "levelname": "INFO", "message": "Dependencies all met for <TaskInstance: stg_cam_asm_load.asm_stage 2020-09-21T14:00:00+00:00 [queued]>", "dag_id": "stg_cam_asm_load", "task_id": "asm_stage", "try_number": "1"}
...
{"execution_date": "2020_09_21T14_00_00_000000", "asctime": null, "filename": "taskinstance.py", "lineno": 1187, "levelname": "INFO", "message": "Marking task as FAILED. dag_id=stg_cam_asm_load, task_id=asm_stage, execution_date=20200921T140000, start_date=20200921T150124, end_date=20200921T150125", "dag_id": "stg_cam_asm_load", "task_id": "asm_stage", "try_number": "1"}
{"execution_date": "2020_09_21T14_00_00_000000", "asctime": null, "filename": "local_task_job.py", "lineno": 102, "levelname": "INFO", "message": "Task exited with return code 1", "dag_id": "stg_cam_asm_load", "task_id": "mail_on_failure_production_support", "try_number": "1"}
end_of_log[2020-09-21 17:01:30,274: INFO/ForkPoolWorker-15] Task airflow.executors.celery_executor.execute_command[1cd84830-e24b-459a-9fc9-46bb4102709b] succeeded in 8.639222555000742s: None
I send all this entries (manually) to Elasticsearch (as it is exactly the output I get):
curl -X POST -H 'Content-Type: application/json' -i 'http://elasticsearch_de:9200/airflow/_doc' --data '{"execution_date": "2020_09_21T14_00_00_000000", "asctime": null, "filename": "taskinstance.py", "lineno": 670, "levelname": "INFO", "message": "Dependencies all met for <TaskInstance: stg_cam_asm_load.mail_on_failure_production_support 2020-09-21T14:00:00+00:00 [queued]>", "dag_id": "stg_cam_asm_load", "task_id": "mail_on_failure_production_support", "try_number": "1"}'
curl -X POST -H 'Content-Type: application/json' -i 'http://elasticsearch_de:9200/airflow/_doc' --data '{"execution_date": "2020_09_21T14_00_00_000000", "asctime": null, "filename": "taskinstance.py", "lineno": 1187, "levelname": "INFO", "message": "Marking task as FAILED. dag_id=stg_cam_asm_load, task_id=asm_stage, execution_date=20200921T140000, start_date=20200921T150124, end_date=20200921T150125", "dag_id": "stg_cam_asm_load", "task_id": "asm_stage", "try_number": "1"}'
curl -X POST -H 'Content-Type: application/json' -i 'http://elasticsearch_de:9200/airflow/_doc' --data '{"execution_date": "2020_09_21T14_00_00_000000", "asctime": null, "filename": "local_task_job.py", "lineno": 102, "levelname": "INFO", "message": "Task exited with return code 1", "dag_id": "stg_cam_asm_load", "task_id": "mail_on_failure_production_support", "try_number": "1"}'
curl -X POST -H 'Content-Type: application/json' -i 'http://elasticsearch_de:9200/airflow/_doc' --data 'end_of_log[2020-09-21 17:01:30,274: INFO/ForkPoolWorker-15] Task airflow.executors.celery_executor.execute_command[1cd84830-e24b-459a-9fc9-46bb4102709b] succeeded in 8.639222555000742s: None'
Problem Description Based on this entries the web server cannot find ANY entry as there are several issues.
- Very obvious, there is no linefeed after “end_of_log”. Hence, the complete line will be sent to Elasticsearch. 1.1 Even if I add a “\n” to the configuration I will see “end_of_log\n”. 1.2 This output is plain text an not JSON.
- The “log_id” which is defined in configuration “log_id_template” is missing in every log output.
- The “offset” is missing. Based on the offset the order of the log entries is defined. Without offset nothing is displayed.
- And, in addition to that, there is a gap between “execution_date” which is in the log output and the execution date format the web GUI queries in Eleasticsearch:
[2020-09-21 16:00:30,922] {base.py:81} INFO - GET http://elasticsearch_de:9200/_count [status:200 request:0.003s]
192.168.105.1 - - [21/Sep/2020:16:00:30 +0000] "GET /admin/airflow/get_logs_with_metadata?dag_id=stg_cam_asm_load&task_id=asm_stage&execution_date=2020-09-21T14%3A00%3A00%2B00%3A00&try_number=1&metadata=%7B%22end_of_log%22%3Afalse%2C%22last_log_timestamp%22%3A%222020-09-21T15%3A55%3A31.254697%2B00%3A00%22%2C%22offset%22%3A%220%22%7D HTTP/1.1" 200 116 "http://192.168.105.100:8080/admin/airflow/log?task_id=asm_stage&dag_id=stg_cam_asm_load&execution_date=2020-09-21T14%3A00%3A00%2B00%3A00&format=json" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0"
As you can see 2020_09_21T14_00_00_000000 != 2020-09-21T14:00:00+00:00 (2020-09-21T14%3A00%3A00%2B00%3A00)
Expected Behavior Only if I manipulate everything described in step 1-4 and add [log_id], [offset], adjust [end_of_log] output and adjust [execution_date]:
curl -X POST -H 'Content-Type: application/json' -i 'http://elasticsearch_de:9200/airflow/_doc' --data '{"log_id": "stg_cam_asm_load-asm_stage-2020-09-21T14:00:00+00:00-1", "execution_date": "2020_09_21T14_00_00_000000", "asctime": null, "filename": "taskinstance.py", "lineno": 670, "levelname": "INFO", "message": "Dependencies all met for <TaskInstance: stg_cam_asm_load.asm_stage 2020-09-21T14:00:00+00:00 [queued]>", "dag_id": "stg_cam_asm_load", "task_id": "asm_stage", "try_number": "1", "offset": "1"}'
curl -X POST -H 'Content-Type: application/json' -i 'http://elasticsearch_de:9200/airflow/_doc' --data '{"log_id": "stg_cam_asm_load-asm_stage-2020-09-21T14:00:00+00:00-1", "execution_date": "2020_09_21T14_00_00_000000", "asctime": null, "filename": "taskinstance.py", "lineno": 1187, "levelname": "INFO", "message": "Marking task as FAILED. dag_id=stg_cam_asm_load, task_id=asm_stage, execution_date=20200921T140000, start_date=20200921T150124, end_date=20200921T150125", "dag_id": "stg_cam_asm_load", "task_id": "asm_stage", "try_number": "1", "offset": "2"}'
curl -X POST -H 'Content-Type: application/json' -i 'http://elasticsearch_de:9200/airflow/_doc' --data '{"log_id": "stg_cam_asm_load-asm_stage-2020-09-21T14:00:00+00:00-1", "execution_date": "2020_09_21T14_00_00_000000", "asctime": null, "filename": "local_task_job.py", "lineno": 102, "levelname": "INFO", "message": "Task exited with return code 1", "dag_id": "stg_cam_asm_load", "task_id": "asm_stage", "try_number": "1", "offset": "3"}'
curl -X POST -H 'Content-Type: application/json' -i 'http://elasticsearch_de:9200/airflow/_doc' --data '{"log_id": "stg_cam_asm_load-asm_stage-2020-09-21T14:00:00+00:00-1", "message": "end_of_log", "offset": "4"}'
…all log entries are displayed in the web GUI:
What would be a helpful bugfix:
- Ensure that every log output is enriched by log_id and offset.
- [end_of_log_mark] is logged within a valid JSON (or probably think about a way how to do this in a different way.)
- [execution_date] log output and the way the web GUI queries in Elasticsearch does match.
Note Maybe my understanding of the whole process is wrong. Any solution without bugfix is highly welcome.
Apache Airflow version: 1.10.12
Environment:
- Cloud provider or hardware configuration: Docker Image in Oracle VirtualBox
- OS (e.g. from /etc/os-release): RHEL8
- Others: Python 3.8
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (4 by maintainers)
I think @melle2 did a great job of describing this issue. #6159 effectively broke end-of-task-log detection for ELS logging.
We’re using the
json_format = True
flag and our ingestor forwards each correctly formatted json fragment to ELS as-is. The problem with #6159 is that it simply dumps the end_of_log marker to the output stream without json decoration or any fields needed for indexing. The ingestor correctly ignores the end_of_log marker because it is not json, and the marker is never forwarded to ELS, thus a document is never indexed with the end_of_log marker.Because a document with the end_of_log marker never exists, airflow-webserver will always return
end_of_log: false
and the browser goes into an XHR request/response loop indefinitely.When a properly formed document with the end_of_log marker is manually indexed, as @melle2 did in their Feb 11 comment, the XHR request returns
end_of_log: true
and all is well.In short, the end_of_log marker should be emitted as a log record.
I have tested this behavior in 2.1.0 and 2.1.2, and I’m happy to open a PR.
I’ve juste tested from the docker image 2.4.2-python3.10 No more endless loop and can see my end_of_log with a log_id in elasticsearch !
So for me the problem is resolved 😃 I will update to the latest version first thing next time : )
@melle2 also the log_id seems to be missing the dag_id & task_id:
--2021-02-11T16:24:40.317864+00:00-1
.Edit: putting simple bracket seems to be a workaround
AIRFLOW__ELASTICSEARCH__LOG_ID_TEMPLATE = {dag_id}-{task_id}-{execution_date}-{try_number}
. Still is either a documentation or unexpected behaviour issue somehow.I now verified the scenario above with released Airflow version 2.0.0. It seems like that everything is fixed and logs are displayed out of the box in Airflow. The only thing what is missing is the output of
end_of_log_mark = end_of_log
. Without that, the “Log view” for a task is still waiting for further log entries. When I manually addcurl -X POST -H 'Content-Type: application/json' -i 'http://elasticsearch_de:9200/airflow/_doc' --data '{"asctime": "2021-02-11 17:24:44,381", "filename": "local_task_job.py", "lineno": 118, "levelname": "INFO", "message": "end_of_log", "dag_id": "stg_cam_asm_load", "task_id": "asm_stage", "execution_date": "2021_02_11T16_24_40_317864", "try_number": "1", "log_id": "--2021-02-11T16:24:40.317864+00:00-1", "offset": 1613060683880414720}'
everything is good. This still needs a fix.