Data-tier switching and high-performance concurrency processing (open source ETL Big Data Governance tool--kettle and two development)

Source: Internet
Author: User

What is ETL? Why use ETL? What is kettle? Why do you want to learn kettle?

ETL is the process of data extraction and cleaning transformation, is the data into the data Warehouse for big data analysis of the loading process, the current popular data into the warehouse process has two forms, one is to enter the database and then to clean and conversion, the other route is the first to clean conversion and then into the database, Our ETL belongs to the latter. Big data can be generally said to be Hadoop, but we have to know if we do not do pre-cleaning and conversion processing, we go into Hadoop only through MapReduce data cleansing conversion and then analysis, garbage data will lead to our disk consumption will be quite large, This virtually increases our hardware cost (hard disk, memory small processing speed will be slow, memory high CPU performance Low speed will also be affected), so although Hadoop theoretically solves the problem of the bad machine to solve the big problem, but in fact, if we have a better node speed is bound to be generally promoted, Therefore, ETL in the big Data environment is still an indispensable data exchange tool. The market popular ETL many, such as Informatica, but the relatively perfect open source is not a lot, and among them more famous to say is Pentaho open source kettle, the tool is widely used, and open source products we can not only learn from the ETL simple application,  And can learn the principle of ETL and through the source to learn more things. Highlight one: Kettle a wide range of applications, only learn to use can find a good job. Highlight two: This course not only explains the simple and practical, but also explains two times development and has a development template to improve the quality of work. Highlight three: Some of the processing methods of big data have been infiltrated and used in conjunction with the currently popular Hadoop. Highlight four: Analysis of Kettle source code, even if there is little interest in ETL, at least to understand some of the source of foreign open source projects, and kettle itself also used a lot of open source projects, so you can learn from the tool more things. What to learn through the course:1.ETL process Principle 2. Principles of the data Flow Engine 3. The design of dynamic data exchange between metadata and data 4. The principle of concurrent operation Class Schedule: (15 hours)  1.etl profile-Open source kettle (1 hours)  > introduction kettle position and role in big data applications. > mainly explains what ETL is, kettle for a brief introduction, and use examples to introduce the use of kettle. > Describes the deployment of the kettle process.   2.kettle use (1 hours)  > details Kettle Spoon use >kettle's trans and job entry >kettle logging and debugging tools using    3. Kettle Step Process Design (3 hours)  > writing examples kettle commonly used conversion, cleaning components > mainly complete the following plugins: Input plug-in: Text file input, generate record, table input, Fixed file input, Get data From XML output plug-in: XML output, delete, insert/update, text file output, update, table output conversion plug-in: Add a checksum, Replace in string, Set field value, Unique rows (HashSet), Add constants, increase sequence, field selection, Split field flow plugin: Abort, switch/case, empty operation, filter record script plugin: Modified Java script Value, execute SQL script query plug-in: File exists, Table exists, call the DB stored procedure   4. Kettle Job Process Design (2 hours)  > writing examples kettle commonly used job components > main completion of the following plug-ins: Universal Plug-in: START, DUMMY, transformation, Success File Management plug-ins: Copy files, Compare folders, create a folder, create file, delete files, delete folders, file Compare, Move Files, Wait for file, Zip file, Unzip file conditional plug-in: Check Db connections, check files locked, check if a folder is empty, check if fi Les exist, File Exists, Table Exists, Wait for script plug-ins: Shell, SQLutility plugin: Ping A host, Truncate tables file transfer plug-in: Upload files to FTPS, Get a file with FTPS, FTP Delete>kettle combined with Hadoop &N Bsp; 5. Kettle Process Performance Tuning and monitoring (1 hours)  > introducing the Kettle process monitoring function > introducing the Kettle Performance optimization method   6. Kettle Embedded Development (1 hours)  > How the kettle process is embedded in our Java applications mainly includes Java embedded in trans and the job process   7. Kettle custom step, job plug-in production (3 hours)  > write step and job template, and give you as two times to develop the basic engineering use, improve everyone's development efficiency. > Write procedures to explain the development of step and job plug-ins.   8. Kettle Data Synchronization Scheme (1 hours)  > describes 5 data synchronization scenarios, and these 5 scenarios support heterogeneous data synchronization. including full-volume fast synchronization scheme and incremental synchronization scheme   9. Kettle Division, cluster and principle (1 hours)  > introduces the partitioning principle of kettle and explains the configuration usage. > introduces the kettle cluster principle, and explains the configuration usage, as well as the monitoring method.   10. Kettle Source Analysis and two development (1 hours)  > Introduction Kettle src Import the Eclipse method, as well as packaging and running the method. > Analyze the package structure and running flow of kettle and explain the operation principle of kettle.  

Data-tier switching and high-performance concurrency processing (open source ETL Big Data governance tools--kettle use and two development)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.