A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
one. ETL Introduction
two. Kettle Introduction
three. Java Invoke Kettle API
first, the introduction of the ETL
1. What is ETL.
1). ETL is "Extract", "Transform", "load" three words of the acronym is also the data extraction, conversion, loading process, but our daily often referred to as the data extraction.
The ETL includes three aspects:
Extract (extraction): The data from a variety of raw business systems read out, this is the premise of all work. The general extraction process requires a connection to a different data source to provide data for subsequent steps. This part looks simple and trivial, and in fact it is a major obstacle to the successful implementation of the ETL solution.
Transform (conversion): According to the predefined rules will be extracted data conversion, so that the original heterogeneous data format can be unified. Any processing of data is a conversion. These processes typically include, but are not limited to, some of the following actions:
Validate data against rules
Modification of data content and structure
Integrating data from multiple data sources
Calculate derived values and aggregate values based on processed data
Load (load): Imports the converted data into the Data warehouse, either incrementally or on a scheduled basis. That is to say, load the data to all operations on the target system.
2). ETL is the process of data extraction (Extract), cleaning (cleaning), transformation (Transform), loading (load). is an important part of data Warehouse, the user extracts the required data from the data source, cleans the data, and finally loads the data into the data warehouse according to the predefined data warehouse model.
3). ETL is the core and soul of BI/DW (Business intelligence/data Warehouse, business intelligence/Data Warehouse), integrates and enhances the value of data according to the unified rules, is responsible for the process of data from data source to target data Warehouse, is an important step in implementing the Data Warehouse.
DW (Data Warehouse): The concept is presented by Dr. Williamh.inmon, who is known as the "Father of the Data Warehouse": A data Warehouse is a theme-oriented, integrated, time-changing, information-stable collection of data, It is used to support enterprise management and decision making.
Topic: Refers to the key aspects that users are concerned about when using data Warehouse to make decision;
Integration: Refers to the Data warehouse information is through a series of processing, sorting and summary of the process;
Changing over time: Recording information from one point in the past to the current stage;
Relatively stable: After data into the data warehouse, generally rarely modify, more is the relative information of the query operation.
BI (Business Intelligence), the business intelligence, is a complete set of solutions to effectively integrate existing data in an enterprise, provide reports quickly and accurately, and make informed decisions.
2. ETL in the BI/DW core embodiment
The function of ETL the whole BI/DW system consists of three parts: data integration, data Warehouse and data mart, multidimensional data analysis. Generally, the business intelligence operation relies on the information system is a traditional system, incompatible data source, database and application of the complex collection of data, the various parts can not communicate with each other. From this point of view: The current operation of the application system is the user spent a lot of energy and resources to build, irreplaceable system, especially the system data. The purpose of the new business Intelligence system is to assist the user to make decision through data analysis, just the source and format of the data are different, which leads to the difficulty of system implementation and data integration. At this point, I very much hope that there is a comprehensive solution to solve the user's dilemma, data consistency and integration issues, so that users can take data from the existing traditional environment and platform, and a single solution to its efficient conversion. The solution is ETL.
ETL is the core and soul of BI/DW, according to unified rules to integrate and improve the value of data, is responsible for the completion of data from the data source to the target Data Warehouse transformation process, is the implementation of the data Warehouse important step. If the Data warehouse model design is a building design blueprint, the data is brick and mortar, then the ETL is the process of building the building. The most difficult part of the whole project is user requirements analysis and model design, and ETL rules design and implementation of the workload is the largest, and its workload to account for the entire project of 60%-80%, this is the domestic and foreign experts from many practices have been the general consensus.
The user's data source distributes in each subsystem and the node, utilizes the ETL to carry on each subsystem the data, through the automatic FTP or the manual control uploads to the UNIX or the NT Server, carries on the extraction, the cleaning and the transformation processing, then loads to the Data warehouse. Because of the existing business data sources, to ensure consistency of data, truly understand the business implications of data, across multiple platforms, multiple systems integration of data, the most likely to improve the quality of data, catering to the changing characteristics of business needs, is the key to ETL technology processing.
3. The realization method of data cleaning in ETL process
First of all, the realization of data table property uniformity based on the understanding of the source data. To solve the problem of the synonym of the source data and the ambiguity of the same name, the metadata management subsystem, while understanding the source data, redefine the names of the different tables in the Data Mining library according to their meanings, and store them in the metadata database in the form of the conversion rules, when the data is integrated, The system automatically converts the field names in the source data to the newly defined field names based on these conversion rules, thereby implementing the synonym of the same name in the Data Mining library.
Second, by reducing data, the volume of data is greatly reduced. Because of the large amount of source data and time-consuming processing, data reduction can be prioritized to improve the efficiency of subsequent data processing analysis.
Finally, the visualization of data cleaning and data conversion is achieved by setting up the visual function node of data processing beforehand. For data reduction and integration, a variety of data processing function nodes are provided through the combination preprocessing subsystem, which can quickly and efficiently complete data cleaning and data conversion process in a visual way.
4. ETL Tool Introduction
ETL Tool function: must be extracted to the data can be flexible calculation, merging, split and other conversion operations.
At present, the typical representative of ETL tools are:
Business software: Informatica, IBM Datastage, Oracle ODI, Microsoft SSIS ...
Open source software: Kettle, Talend, Cloveretl, Ketl,octopus ...
two. Kettle Introduction
1. Kettle's life in the present lifetime
Kettle is the name of PDI before, the full name of the PDI is Pentaho data integeration,kettle meaning of the kettle, expressed the meaning of the data stream.
Kettle's main author, Matt, started the project in 2003, and the earliest dates in the PDI code could be seen in April 2003. Starting with version 2.2, the Kettle Project entered the open source domain and complied with the LGPL protocol.
In 2006 Kettle joined the Open source BI (Business Intelligence) organization Pentaho, officially named PDI, Pentaho after joining the kettle more and more rapid development, and more and more people began to pay attention to it.
2. What is kettle.
Kettle is a foreign open source of ETL tools, pure Java writing, can be in Windows, Linux, Unix run, green without installation, data extraction efficient and stable.
Kettle Chinese name is kettle, the project's main programmer Matt wants to put a variety of data into a pot and then flow out in a specified format.
Kettle This ETL toolset, which allows you to manage data from different databases, by providing a graphical user environment to describe what you want to do rather than what you want to do.
There are two kinds of script files in Kettle, transformation and job,transformation complete the base conversion for the data, and the job completes the control of the entire workflow.
As an important part of Pentaho, it is now gradually increasing in the application of domestic projects.
KETTLE:KETTLE is a acronym for "Kettle E.T.T.L. Environment". This is means it has been designed to help for you with your Ettl needs:the extraction, transformation, transportation and Loadi Ng of data
Kettle is "kettle e.t.t.l. Envirnonment" initials only, which means it is designed to help you achieve your Ettl needs: Extract, transform, load, and load data; the name should be called a kettle. The origin of the name is what Matt, the program's main programmer, said in a forum: I want to put all kinds of data in a pot and then flow out in a specified format.
3. Installation and operation of Kettle
1>. Download installation
Can download the latest version of the Kettle software from http://kettle.pentaho.org, at the same time, kettle is green software, download, extract to any directory can. Since kettle is written in Java and requires a local JVM-running environment, the latest version of the kettle we use is 5.1. After the installation is complete, click the Kettle.exe or Spoon.bat below the directory to start the kettle. When you start kettle, a dialog box pops up, allowing the user to choose a resource pool.
Repository: is used to save the transformation task, which records our operation steps and related log, conversion, JOB and other information. The transformation tasks that users create through the graphical interface can be saved in the repository. The repository can be a variety of common databases, with user names/passwords accessing resources in the Repository, and the default username/password is admin/admin. The resource pool is not required, and if there is no repository, the user can also save the transformation task in an XML file.
2>. Configure the Kettle environment variable: (provided that the Java environment variable is configured, because he is Java-written, requires the local JVM's running environment) to add kettle_home variables to the system's environment variable, directory to the Kettle installation directory: D:\kettle\ Data-integration (specific to the installation path)
New system variable: kettle_home
Variable Value: D:\kettle\data-integration (specific to the installation path, kettle decompression path, until Kettle.exe directory)
Select path to add environment variables:
Variable name: PATH
Variable value:% Kettle_home%;
3. Basic concepts of Kettle
1. Job (Job), which organizes [transforms] together to complete a piece of work, usually we need to break down a large task into several logically isolated jobs, and when these jobs are completed, the task is complete.
1>. Job Entry: A job Entry is part of a task that performs certain things.
2>. Hop: A hop represents one or more data streams between two steps. A hop always represents the connection between the two job Entry and can be set by the original job Entry, unconditionally executing the next job Entry,
Until execution succeeds or fails.
3>. Note: An additional text annotation message for a task.
2. Conversion (transformation), defines a container for the operation of data, data manipulation is a process from input to output, can be understood as smaller than the size of the operation of the container, we break the task into a job, and then need to break the job into one or more transformations, Each transformation completes only a portion of the work.
1>. The Value:value is part of the row and is the data that contains the following types: Strings, floating point Numbers, unlimited precision, bignumbers, integers, Or a Boolean.
2>. Row: One line contains 0 or more values.
3>. Output stream: An output stream is the stack of rows that leave a step.
4>. Input stream: An input stream is the stack of rows that enter a step.
5>. Step: A sequence of transformations, which can be a stream or other element.
6>. Hop: A hop represents one or more data streams between two steps. A hop always represents a step in the output stream and a step in the input stream.
7>. Note: An annotation is a translation of additional textual comment information.
3). Kettle is an Open-source ETL tool that includes 4 parts.
chef--task (Job) design tool (GUI method)
kitchen--task (Job) executor (command line)
spoon--conversion (Transform) design tool (GUI method)
span--conversion (Transform) Actuator (command-line method)
The current version has not seen the shadow of chef, because it contains the function of the design job in the spoon. It seems kettle is integrating all the design tools into the spoon. The spoon also includes job and transform execution functions.
Spoon allows you to design the ETL conversion process (transformation) and job through graphical interfaces.
Spoon is a graphical user interface that allows you to run transformations or tasks where the conversion is run with the Pan tool, and the task is run with kitchen. A Pan is a data transformation engine that performs many functions, such as reading, manipulating, and writing data from different data sources. Kitchen is a task that can be run using XML or a data resource library description. Typically, tasks are run automatically in batch mode within a specified time interval.
The PAN allows you to run a batch of ETL transformations designed by spoon (for example, using a time scheduler). Pan is a background-executed program with no graphical interface.
Kitchen is also a background-run program.
4. Kettle Basic Use
1>. The function and startup way of several subroutines of kettle
Spoon.bat: Starts the job and transformation designer in a graphical interface.
Pan.bat: The command line performs the conversion.
Kitchen.bat: Executes a job on a command-line basis.
2>. Transformations and Jobs
The Kettle Spoon Designer is used to design transformations (transformation) and jobs (job).
• Conversions are primarily for data processing, and a transformation can contain multiple steps (step).
• Job is a higher level of processing than conversion, one job includes more than one job item (Job Entry), one job item represents a job, conversion is also a job item.
3>. Introduction to Entering Steps
The input class step is used to obtain data from the outside, the data source that can obtain the data includes, the text file (Txt,csv,xml,json) database, the Excel file and so on desktop file, the custom data and so on. Custom input Plug-ins can be customized for special data sources and application requirements. Example: Generating random number steps
4>. Introduction to Conversion steps
The transformation class step is the steps that are used to convert data into various forms.
5>. Introduction to process steps
Process steps are steps that you use to control the flow of data. Generally do not operate on the data, just control the data flow. Example: Filtering steps
6>. Introduction to Connection Steps
The connection step is used to connect the different datasets together. Example: Cartesian product
7>. Introduction to the output steps
Output step is the process of output data, common output including text file output, table output, etc., can be developed according to the application requirements of other forms of output.
Example: Table Output
8> in kettle the way metadata is stored:
• Resource pool resource pool includes file resource pool, database resource Library Kettle 4.0 resource pool types can be plug-in extensions
XML file. ktr The root node of the XML that transforms the file must be the root node of the <transformation> KJB job XML is <job>
Database Resource library:
• Serialization of kettle metadata into a database, such as the R_transformation table, which preserves the names, descriptions, and other attributes of the kettle transformation.
• Create and upgrade database repository in Spoon
File resource pool:
The Org.pentaho.di.repository.Repository interface is implemented on the basis of the encapsulation of the file. Is the resource pool type added in the Kettle 4.0 version
Do not use a resource pool: Save directly as a ktr or KJB file.
Kettle resource Pool – How to select a resource pool.
Disadvantages of the database repository:
• Cannot store multiple versions of a transformation or job.
• A locking mechanism that relies heavily on the database to prevent loss of work.
• Without considering team development, developers cannot lock up a job and develop it themselves.
Disadvantages of the file resource pool:
• The association between objects (such as transformations, jobs, database connections, etc.) is difficult to handle, so deleting, renaming, and so on can be tricky.
• No version history.
• Difficulty in team development.
Do not use repository: Use SVN for file versioning.
9> command-line execution Kettle file
The parameter format description has two parameter formats 1. /Parameter name: Value or-parameter name = value recommended using the first parameter format.
1. Execute test.ktr file log saved in D:\log.txt, the default log level is basic
2. Execute test.ktr file log is saved in D:\log.txt, log level is Rowlevel
3. Export a job file and the conversion and other resources that the job file depends on
4. Direct execution of an exported zip file
5. Common mistakes and their elimination
1>. Connection to database failed
The solution is that the jar package that links the corresponding database is not added to the Kettel. Find a database jar package in Baidu or a previous Web project. Copy the Mysql-connector-java-5.1.18.jar to the D:\kettle5.1\data-integration\lib and reboot once, which is fine. Test again and the test is successful through the link.
2>. When the data is passed, the processing of Chinese garbled characters occurs.
In the Database Connection tab, add named parameters to the options setcharacterencoding = UTF8
Also make sure that Character set is set to UTF8 when you create the database.
Start building with 50+ products and up to 12 months usage for Elastic Compute Service