Workflow Management Platform Airflow

Source: Internet
Author: User
Tags timedelta

Airflow1. Introduction

Airflow is a python-written workflow management platform (Workflow management platform) from Airbnb open source. In the previous article, we described how to manage data flow with crontab, but the drawbacks are obvious. For the disadvantage of crontab, the flexible and extensible airflow has the following features:

    • Visualization of workflow dependencies;
    • Log tracking;
    • (Python script) easy to extend

In contrast to the Java system, Oozie,airflow pursues the "Configuration as Code" philosophy, which uses Python for describing workflows, judging trigger conditions, and so on, so you can write your workflow as if you were writing a script. , can debug workflow (Test Backfill command), better to determine whether there are errors, can be more quickly on-line function expansion. Airflow to make the most of Python's dexterity and lightness, compared to Oozie's clumsy and clumsy (in fact, I'm not in black java~~). "What makes airflow great?" introduces more about the excellent features of airflow; other documentation for installation, Introduction here, and here.

The following table shows the comparison of Airflow (based on version 1.7) and Oozie (based on version 4.0):

function Airflow Oozie
Workflow description Python Xml
Data triggering Sensor Datasets, input-events
Workflow Node operator Action
Full workflow Dag Workflow
Regular scheduling DAG Schedule_interval Coordinator frequency
Task dependent >>,<< <ok to>
Built-in functions, variables Template macros El function, El constants

I mentioned earlier that Oozie does not have the ability to express complex dagsbecause Oozie can only specify a dirty dependency (downstream) and cannot specify a high dependency (upstream). In contrast, airflow can represent complex dags. Airflow does not differentiate workflow from coordinator as Oozie, but instead considers trigger conditions, workflow nodes as a operator,operator to form a dag.

2. Actual combat

The following shows how to complete the data Pipeline task with airflow.

First, a brief introduction to the background: timed (weekly) Check the Hive table's partition task is generated, if any, trigger the hive task write Elasticsearch, and then after the hive task is completed, execute a Python script query elasticsearch send the report. However, airflow has a problem with Python3 support (the dependency package is written for Python2), so you have to write it yourself HivePartitionSensor :

#-*-Coding:utf-8-*-# @Time: 2016/11/29# @Author: RainFrom Airflow.operatorsImport BasesensoroperatorFrom Airflow.utils.decoratorsImport Apply_defaultsFrom Impala.dbapiImportConnectImport loggingClass Hivepartitionsensor(Basesensoroperator):"""Waits for a partition to show on Hive.:p Aram Host, port:the host and Port of Hiveserver2:p Aram table:the name of the table to wait for, supports the dot notation (my_database.my_table): Type table:string:P Aram partition:the partition clause to wait for. This is passed asis to the Metastore Thrift Client,and apparently supports SQL likenotation as in ' ds= ' 2016-12-01 '.: Type partition:string"" "Template_fields= (' Table ',' Partition ',) Ui_color=' #2b2d42 ' @apply_defaultsDef __init__(Self, Conn_host, Conn_port, table, partition="Ds= '{{Ds}}‘", Poke_interval=60 * 3,*Args**Kwargs):Super (Hivepartitionsensor,Self).__init__ (Poke_interval=poke_interval,*args,**kwargs)IfNot partition:partition="Ds= '{{Ds}}‘"Self.table= TableSelf.partition= PartitionSelf.conn_host= Conn_hostSelf.conn_port= Conn_portSelf.conn=Connect (host=Self.conn_host, Port=Self.conn_port, Auth_mechanism=' PLAIN ')Def Poke(Self, context): Logging.info (' Poking for table{self.table}, ‘' Partition {self.partition} '.Format**Locals ())) cursor= self.conn.cursor () cursor.execute (  "show Partitions {}". format (self.table)) partitions = cursor.fetchall () partitions Span class= "OP" >= [I[0] for I in partitions] if  Self.partition in partitions:  return true else: return false    

The Python3 connects Hive Server2 with the Impyla module, which is HivePartitionSensor used to determine whether the partition of the hive table exists. Writing a custom operator is a bit like writing a hive, pig UDF, and writing a operator needs to be placed in a directory for the ~/airflow/dags dag to invoke. So, the complete workflow dag is as follows:

# tag cover analysis, based on airflow v1.7.1.3From Airflow.operatorsImport BashoperatorFrom Operatorud.hivepartitionsensorImport HivepartitionsensorFrom Airflow.modelsImport DAGFrom datetimeImport datetime, TimedeltaFrom Impala.dbapiImportConnectconn=Connect (host=' 192.168.72.18 ', port=10000, Auth_mechanism=' PLAIN ')Def Latest_hive_partition(table): Cursor= Conn.cursor () cursor.execute ("Show Partitions {}".Format (table)) partitions= Cursor.fetchall () partitions= [i[0]For IIn partitions]Return partitions[-1].split ("=")[1]log_partition_value="""{{Macros.ds_add (DS,-2)}}"" "Tag_partition_value= Latest_hive_partition (' Tag.dmp ') args= {' Owner ':' Jyzheng ',' Depends_on_past ':False,' Start_date ': Datetime.strptime (' 2016-12-06 ','%y-%m-%d‘)}# Execute every Tuesdaydag= DAG (dag_id=' Tag_cover ', Default_args=args, Schedule_interval=' @weekly ', dagrun_timeout=timedelta (minutes=)) Ad_sensor= Hivepartitionsensor (task_id=' Ad_sensor ', conn_host=' 192.168.72.18 ', conn_port=10000, table=' Ad.ad_log ', partition="Day_time={}".Format (log_partition_value), Dag=dag) Ad_hive_task= Bashoperator (task_id=' Ad_hive_task ', Bash_command=' Hive-f/path/to/cron/cover/ad_tag.hql--hivevar log_partition={} ''--hivevar tag_partition={} '.Format (Log_partition_value, tag_partition_value), Dag=dag) Ad2_hive_task= Bashoperator (task_id= ' Ad2_hive_task ', Bash_command=  ' hive-f/path/to/cron/cover/ad2_tag.hql--hivevar log_partition={} ' < Span class= "hljs-string" > '--hivevar tag_partition={} '. format (log_partition_value, Tag_partition_value), Dag=dag) report_task = bashoperator (Task_id= ' Report_ Task ', Bash_command= ' sleep 5m; python3/path/to/cron/ report/tag_cover.py {} '. format (log_partition_value), Dag=dag) ad_sensor >> ad_ Hive_task >> report_taskad_sensor >> ad2_hive_task  >> report_task              

Workflow Management Platform Airflow

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.