ETL (extract-transform-load abbreviation, that is, data extraction, transformation, loading process), for enterprise or industry applications, we often encounter a variety of data processing, conversion, migration, so understand and master the use of an ETL tool, essential, Here I introduce a I used in the work of 3 years of ETL tools kettle, the spirit of good things not only enjoy the idea of sharing the collision with everyone to exchange a bit! In use I feel this tool is really powerful, support graphical GUI design interface, and then can flow in the form of workflow, in doing some simple or complex data extraction, quality testing, data cleansing, data conversion, data filtering and other aspects have a relatively stable performance, the most important we use it skillfully, It has reduced a lot of research and development work and improved our productivity, but the only regret for my. NET developers is that the tool is written in Java.
1. Kettle Concept
Kettle is a foreign open source ETL tool, written in Java, can run on Windows, Linux, UNIX, green without installation, data extraction efficient and stable.
Kettle Chinese name is the kettle, the project's main programmer, Matt, wants to put all kinds of data into a pot and then flow out in a specified format.
Kettle This ETL toolset, which allows you to manage data from different databases by providing a graphical user environment to describe what you want to do, rather than what you want to do.
There are two script files in Kettle, transformation and job,transformation complete the fundamental transformation of the data, and the job completes the control of the entire workflow.
2. Download and deploy
? Kettle can be downloaded from the http://kettle.pentaho.org/website?? Download the kettle package, as the kettle is green software, unzip to any local path can
3, Kettle Environment configuration (with the Java environment directly ignore this section)
3, 1 installing Java JDK
1) first to the official Web download corresponding JDK package, JDK1.5 or above version on the line;
2) Install the JDK;
3) Configure the environment variables, with configuration method:
After the installation is complete, it is necessary to configure it to use, first set some environment variables, for Java, the most necessary environment variables are the system path variable path.
(1) To open the Settings window for the environment variable. Right-click My Computer and select the Properties option from the popup shortcut menu to go to the System Properties dialog box. Select the "Advanced" tab, go to the "Advanced" tab, click the "Environment variables" button, and go to the "Environment variables" dialog box:
(2) in the "Administrator User variables" list box, select the variable path, and then click the Edit button when the row becomes highlighted.
(3) in the "Edit System Variables" dialog box that pops up, set the bin directory path under the JDK installation path to the PATH variable.
After editing, click the "OK" button to save, the environment variable path setting is formally completed.
Note: Setting the path of the path variable must be the bin directory in the JDK installation directory, and sometimes the installation directory of the JRE will be available on the same level as the JDK installation directory, so carefully select the relevant path and avoid setting the path to the bin directory under the JRE directory.
3, 2 Test the JDK configuration is successful
Once you've set up your environment variables, you can test the variables you just set up and see if Java can run.
(1) Click the Start button, select the Run option, and enter the cmd command in the Run dialog box.
(2) After clicking the "OK" button, open the Command Line window.
(3) At the cursor input: Javac command, press the ENTER key to execute, you can see the test results
3, 3 running Kettle
into the kettle directory, if Kettle is deployed in a Windows environment, double-clicking the run Spoon.bat or Kettle.exe file, the following interface appears: Configure the environment this piece is basically done.
4. Kettle Usage and Components introduction
? 4.1 Kettle UseKettle provides a way to integrate all of the work, but because of the resource pool portability is not convenient, so we choose no resource pool; 1) Create a new transformation, click Save to Local path, for example, save to D:/etltest, Save file named Etltesttrans,kettle default transformation file after saving suffix named ktr;2) create a new job, click Save to Local path, for example save to D:/etltest, save file named Etltestjob, Kettle The default job file is saved with the suffix named KJB;
4.2 Introduction to the component Tree
|
The main tree menu lists the basic properties of a transformation that can be viewed through individual nodes. DB connection: Displays the database connection in the current transformation, and each transformation database connection needs to be configured separately. Steps: A list of links that are applied to a transformation Hops: A list of node connections that are applied to a transformation |
|
The Core Objects menu lists the links that can be called in transformation, which can be added by dragging the mouse. Input: Import link Output: Export link Lookup: Query link Transform: Conversion link Joins: Connection link Scripting: Script Link |
4.3 Transformation Conversion Introduction? Each link can be dragged by the mouse to add links to the main window.? and can be shift+ by dragging the mouse to achieve the link between links.
Conversion Common Links Introduction
Category |
Link name |
Function description |
Input |
Text file input |
Entering data from a local text file |
Table input |
Entering data from a database table |
Get System Information |
Read System Information input data |
Output |
Text file output |
Output processing results to a text file |
Table Output |
Output processing results to a database table |
Insert/Update |
Inserts an update on the database table model based on the results of the processing, or an update if no related records exist in the database. Will be judged based on the fields in the query criteria |
Update |
Update the database according to the processing result, if the data that need to be updated is not recorded in the database table, the error will stop. |
Delete |
Database records are deleted based on processing results, and if data that needs to be deleted is not recorded in the database table, an error is stopped |
Lookup |
Database queries |
Query the target table according to the set query criteria and return the required result fields |
Stream query |
Reads the target table into memory and queries the in-memory dataset with query criteria |
Call DB Stored Procedure |
Calling a database stored procedure |
Transform |
Field Selection |
Select the fields you want, filter out the fields you want, or do database fields |
Filter Records |
Classify records according to criteria |
Sort records |
Sort the data according to a condition |
Empty operation |
No action |
Add constant |
Increase the required constant fields |
Scripting |
Modified Java Script Value |
Extend functionality, write JavaScript scripts, and process data accordingly |
Mapping |
Mapping (sub-transformations) |
Data mapping |
Job |
Sat Variables |
Setting environment variables |
Get Variables |
Get Environment variables |
4.3 Job Task Description
|
The main tree menu lists the basic properties of a job that can be viewed through individual nodes. DB connection: Displays the database connection in the current job, and the database connection for each job needs to be configured separately. Job Entries: A list of links referenced in a job |
|
The Job Entries menu lists the list of links that can be called in the job, which can be added by dragging the mouse. Each link can be dragged through the mouse to add links to the main window. And through the shift+ mouse drag, to achieve the link between links. |
? Introduction of common links
Category |
Link name |
Function description |
Job entries |
START |
Begin |
DUMMY |
End |
Transformation |
Referencing the transformation process |
Job |
Reference Job Process |
Shell |
Invoking the shell script |
Sql |
Execute SQL statement |
Ftp |
Download via FTP |
Table exists |
Checks if the target table exists, returns a Boolean value |
File exists |
Checks if a file exists, returns a Boolean value |
Javascript |
Executing JavaScript scripts |
Create file |
Create a file |
Delete file |
deleting files |
Wait for file |
Wait for the file, after the file appears, continue to the next link |
File Compare |
File comparison, returns a Boolean value |
Wait for |
Wait time, set a period of time, kettle process is waiting state |
Zip file |
Compress files as Zip packages |
Transfer from: http://www.cnblogs.com/limengqiang/archive/2013/01/16/KettleApply1.html
ETL Learning Series 1--etl Tool installation