Open source Job scheduling tool to realize open-source datax, Sqoop, Kettle and other ETL tools job batch Automation scheduling

Source: Internet
Author: User
Tags json python script sqoop

1, Ali Open source software: datax

Datax is a heterogeneous data source offline Synchronization tool that is dedicated to achieving stable and efficient data synchronization between heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, and more. (Excerpt from Wikipedia)

2. Apache Open source software: Sqoop

Sqoop (pronunciation: skup) is an open source tool that is used primarily in Hadoop (Hive) and traditional databases (MySQL, PostgreSQL ...) Data can be transferred from one relational database (such as MySQL, Oracle, Postgres, etc.) to the HDFs in Hadoop, or the data in HDFs can be directed into a relational database. (Excerpt from Wikipedia)

3. Kettle Open Source software: Kettle (Chinese name)

Kettle is a foreign open source ETL tool, written in Java, can be run on Windows, Linux, UNIX, data extraction is efficient and stable. (Excerpt from Wikipedia)

The introduction of the above three open source ETL tools are from the encyclopedia content, personal kettle used more, the other two used relatively little. In fact, whether it is open source or commercial ETL tools have their own job scheduling, but from the use of flexibility and simplicity, it is not as good as the third-party professional to do batch job scheduling tools. Since they are all tools to facilitate the use of people, why not use better tools to relieve our workload, so that we can devote more effort to the business itself. Here is to share a third party open source batch job Automation tool TASKCTL (open source Community address: https://www.oschina.net/p/taskctl), see how taskctl easy to implement open source ETL Tools Datax, Sqoop, Kettle and other job-batch scheduling. Don't say much nonsense, go straight to dry.

TASKCTL uses the task plug-in driver mechanism, therefore, can support various stored procedures, various scripts, as well as various ETL tool tasks such as Datastage\informatica\kettle, can complete serial, parallel, dependence, mutual exclusion, execution plan, timing, fault tolerance, loop, Various core scheduling functions such as conditional branching, remote, load balancing, and custom conditions.

The following is an example of scheduling the Datax job type:

$ cd {your_datax_dir_bin}
$ python datax.py./mysql2odps.json

We can see that calling Datax is actually calling the Python script.

So we can configure the XML fragment of the job directly in Taskctl as follows:

<python>
    <name>datax_job</name>
    <progname>datax.py</progname>   -- There may be a need to navigate to CD {Your_datax_dir_bin}
    <para>./mysql2odps.json</para>
  </python>

Of course, if you want to make the Datax job type look more personalized (or something in a plugin). We can also configure a separate task plug-in Datax, with the following steps:

1. Write the script file called Datax cprundataxjob.sh:

#!bin/bash if [$#-ne 3] then echo "Param Error!" echo "Usage: $ progname para expara" exit 126 fi #------------------------------------------------------------------ ------------# First step: Receive parameter #------------------------------------------------------------------------------progname=$ 1 para=$2 exppara=$3 #------------------------------------------------------------------------------# Step Two: Run the job, and so on Pending result #------------------------------------------------------------------------------#cd {Your_datax_dir_bin}-- Equivalent to Exppara environment parameter in taskctl CD ${exppara} #python datax.py./mysql2odps.json python datax.py ${progname} #收集datax. py execution result ret

Info=$? #------------------------------------------------------------------------------# Fourth Step: plugin return #--------------------- ---------------------------------------------------------#根据retinfo的信息, return to Taskctl if [${retinfo}-eq 0] then Ech 
O "echo" Run job success! "Else echo" "echo" Run Job failed! " Fi ExiT ${retinfo} 

After configuration, place the cprundataxjob.sh in the $taskctldir/src/plugin/dataxjob/shell/directory on the TASKCTL server.

2, in the TASKCTL desktop software Admin configuration plug-in as shown below:

3. Write the module code in the designer as follows:

<dataxjob>
    <name>MainModul_JobNode0</name>
    <progname>./mysql2odps.json</ Progname>
    <exppara>[your datax installation path]</exppara>
  </dataxjob>

4, complete the module code after writing, the following:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.