1, Ali Open source software: datax
Datax is a heterogeneous data source offline Synchronization tool that is dedicated to achieving stable and efficient data synchronization between heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, and more. (Excerpt from Wikipedia)
2. Apache Open source software: Sqoop
Sqoop (pronunciation: skup) is an open source tool that is used primarily in Hadoop (Hive) and traditional databases (MySQL, PostgreSQL ...) Data can be transferred from one relational database (such as MySQL, Oracle, Postgres, etc.) to the HDFs in Hadoop, or the data in HDFs can be directed into a relational database. (Excerpt from Wikipedia)
3. Kettle Open Source software: Kettle (Chinese name)
Kettle is a foreign open source ETL tool, written in Java, can be run on Windows, Linux, UNIX, data extraction is efficient and stable. (Excerpt from Wikipedia)
The introduction of the above three open source ETL tools are from the encyclopedia content, personal kettle used more, the other two used relatively little. In fact, whether it is open source or commercial ETL tools have their own job scheduling, but from the use of flexibility and simplicity, it is not as good as the third-party professional to do batch job scheduling tools. Since they are all tools to facilitate the use of people, why not use better tools to relieve our workload, so that we can devote more effort to the business itself. Here is to share a third party open source batch job Automation tool TASKCTL (open source Community address: https://www.oschina.net/p/taskctl), see how taskctl easy to implement open source ETL Tools Datax, Sqoop, Kettle and other job-batch scheduling. Don't say much nonsense, go straight to dry.
TASKCTL uses the task plug-in driver mechanism, therefore, can support various stored procedures, various scripts, as well as various ETL tool tasks such as Datastage\informatica\kettle, can complete serial, parallel, dependence, mutual exclusion, execution plan, timing, fault tolerance, loop, Various core scheduling functions such as conditional branching, remote, load balancing, and custom conditions.
The following is an example of scheduling the Datax job type:
$ cd {your_datax_dir_bin}
$ python datax.py./mysql2odps.json
We can see that calling Datax is actually calling the Python script.
So we can configure the XML fragment of the job directly in Taskctl as follows:
<python>
<name>datax_job</name>
<progname>datax.py</progname> -- There may be a need to navigate to CD {Your_datax_dir_bin}
<para>./mysql2odps.json</para>
</python>
Of course, if you want to make the Datax job type look more personalized (or something in a plugin). We can also configure a separate task plug-in Datax, with the following steps:
1. Write the script file called Datax cprundataxjob.sh:
#!bin/bash if [$#-ne 3] then echo "Param Error!" echo "Usage: $ progname para expara" exit 126 fi #------------------------------------------------------------------ ------------# First step: Receive parameter #------------------------------------------------------------------------------progname=$ 1 para=$2 exppara=$3 #------------------------------------------------------------------------------# Step Two: Run the job, and so on Pending result #------------------------------------------------------------------------------#cd {Your_datax_dir_bin}-- Equivalent to Exppara environment parameter in taskctl CD ${exppara} #python datax.py./mysql2odps.json python datax.py ${progname} #收集datax. py execution result ret
Info=$? #------------------------------------------------------------------------------# Fourth Step: plugin return #--------------------- ---------------------------------------------------------#根据retinfo的信息, return to Taskctl if [${retinfo}-eq 0] then Ech
O "echo" Run job success! "Else echo" "echo" Run Job failed! " Fi ExiT ${retinfo}
After configuration, place the cprundataxjob.sh in the $taskctldir/src/plugin/dataxjob/shell/directory on the TASKCTL server.
2, in the TASKCTL desktop software Admin configuration plug-in as shown below:
3. Write the module code in the designer as follows:
<dataxjob>
<name>MainModul_JobNode0</name>
<progname>./mysql2odps.json</ Progname>
<exppara>[your datax installation path]</exppara>
</dataxjob>
4, complete the module code after writing, the following: