Airflow1. Introduction
Airflow is a python-written workflow management platform (Workflow management platform) from Airbnb open source. In the previous article, we described how to manage data flow with crontab, but the drawbacks are obvious. For the disadvantage of crontab, the flexible and extensible airflow has the following features:
- Visualization of workflow dependencies;
- Log tracking;
- (Python script) easy to extend
In contrast to the Java system, Oozie,airflow pursues the "Configuration as Code" philosophy, which uses Python for describing workflows, judging trigger conditions, and so on, so you can write your workflow as if you were writing a script. , can debug workflow (Test Backfill command), better to determine whether there are errors, can be more quickly on-line function expansion. Airflow to make the most of Python's dexterity and lightness, compared to Oozie's clumsy and clumsy (in fact, I'm not in black java~~). "What makes airflow great?" introduces more about the excellent features of airflow; other documentation for installation, Introduction here, and here.
The following table shows the comparison of Airflow (based on version 1.7) and Oozie (based on version 4.0):
function |
Airflow |
Oozie |
Workflow description |
Python |
Xml |
Data triggering |
Sensor |
Datasets, input-events |
Workflow Node |
operator |
Action |
Full workflow |
Dag |
Workflow |
Regular scheduling |
DAG Schedule_interval |
Coordinator frequency |
|
|
|
Task dependent |
>> ,<< |
<ok to> |
Built-in functions, variables |
Template macros |
El function, El constants |
I mentioned earlier that Oozie does not have the ability to express complex dagsbecause Oozie can only specify a dirty dependency (downstream) and cannot specify a high dependency (upstream). In contrast, airflow can represent complex dags. Airflow does not differentiate workflow from coordinator as Oozie, but instead considers trigger conditions, workflow nodes as a operator,operator to form a dag.
2. Actual combat
The following shows how to complete the data Pipeline task with airflow.
First, a brief introduction to the background: timed (weekly) Check the Hive table's partition task is generated, if any, trigger the hive task write Elasticsearch, and then after the hive task is completed, execute a Python script query elasticsearch send the report. However, airflow has a problem with Python3 support (the dependency package is written for Python2), so you have to write it yourself HivePartitionSensor
:
#-*-Coding:utf-8-*-# @Time: 2016/11/29# @Author: RainFrom Airflow.operatorsImport BasesensoroperatorFrom Airflow.utils.decoratorsImport Apply_defaultsFrom Impala.dbapiImportConnectImport loggingClass Hivepartitionsensor(Basesensoroperator):"""Waits for a partition to show on Hive.:p Aram Host, port:the host and Port of Hiveserver2:p Aram table:the name of the table to wait for, supports the dot notation (my_database.my_table): Type table:string:P Aram partition:the partition clause to wait for. This is passed asis to the Metastore Thrift Client,and apparently supports SQL likenotation as in ' ds= ' 2016-12-01 '.: Type partition:string"" "Template_fields= (' Table ',' Partition ',) Ui_color=' #2b2d42 ' @apply_defaultsDef __init__(Self, Conn_host, Conn_port, table, partition="Ds= '{{Ds}}‘", Poke_interval=60 * 3,*Args**Kwargs):Super (Hivepartitionsensor,Self).__init__ (Poke_interval=poke_interval,*args,**kwargs)IfNot partition:partition="Ds= '{{Ds}}‘"Self.table= TableSelf.partition= PartitionSelf.conn_host= Conn_hostSelf.conn_port= Conn_portSelf.conn=Connect (host=Self.conn_host, Port=Self.conn_port, Auth_mechanism=' PLAIN ')Def Poke(Self, context): Logging.info (' Poking for table{self.table}, ‘' Partition {self.partition} '.Format**Locals ())) cursor= self.conn.cursor () cursor.execute ( "show Partitions {}". format (self.table)) partitions = cursor.fetchall () partitions Span class= "OP" >= [I[0] for I in partitions] if Self.partition in partitions: return true else: return false
The Python3 connects Hive Server2 with the Impyla module, which is HivePartitionSensor
used to determine whether the partition of the hive table exists. Writing a custom operator is a bit like writing a hive, pig UDF, and writing a operator needs to be placed in a directory for the ~/airflow/dags
dag to invoke. So, the complete workflow dag is as follows:
# tag cover analysis, based on airflow v1.7.1.3From Airflow.operatorsImport BashoperatorFrom Operatorud.hivepartitionsensorImport HivepartitionsensorFrom Airflow.modelsImport DAGFrom datetimeImport datetime, TimedeltaFrom Impala.dbapiImportConnectconn=Connect (host=' 192.168.72.18 ', port=10000, Auth_mechanism=' PLAIN ')Def Latest_hive_partition(table): Cursor= Conn.cursor () cursor.execute ("Show Partitions {}".Format (table)) partitions= Cursor.fetchall () partitions= [i[0]For IIn partitions]Return partitions[-1].split ("=")[1]log_partition_value="""{{Macros.ds_add (DS,-2)}}"" "Tag_partition_value= Latest_hive_partition (' Tag.dmp ') args= {' Owner ':' Jyzheng ',' Depends_on_past ':False,' Start_date ': Datetime.strptime (' 2016-12-06 ','%y-%m-%d‘)}# Execute every Tuesdaydag= DAG (dag_id=' Tag_cover ', Default_args=args, Schedule_interval=' @weekly ', dagrun_timeout=timedelta (minutes=)) Ad_sensor= Hivepartitionsensor (task_id=' Ad_sensor ', conn_host=' 192.168.72.18 ', conn_port=10000, table=' Ad.ad_log ', partition="Day_time={}".Format (log_partition_value), Dag=dag) Ad_hive_task= Bashoperator (task_id=' Ad_hive_task ', Bash_command=' Hive-f/path/to/cron/cover/ad_tag.hql--hivevar log_partition={} ''--hivevar tag_partition={} '.Format (Log_partition_value, tag_partition_value), Dag=dag) Ad2_hive_task= Bashoperator (task_id= ' Ad2_hive_task ', Bash_command= ' hive-f/path/to/cron/cover/ad2_tag.hql--hivevar log_partition={} ' < Span class= "hljs-string" > '--hivevar tag_partition={} '. format (log_partition_value, Tag_partition_value), Dag=dag) report_task = bashoperator (Task_id= ' Report_ Task ', Bash_command= ' sleep 5m; python3/path/to/cron/ report/tag_cover.py {} '. format (log_partition_value), Dag=dag) ad_sensor >> ad_ Hive_task >> report_taskad_sensor >> ad2_hive_task >> report_task
Workflow Management Platform Airflow