Available for ETL tools under Hadoop--kettle

Source: Internet
Author: User

See you share a lot of Hadoop related content, I introduce you to an ETL tool--kettle.
Kettle is an ETL tool of Pentaho company Open source, like Hadoop, is also Java implementation, the purpose is to do data integration when the data extraction (Extract), conversion (Transformat), load (loading) work. There are two script files in Kettle, transformation and job,transformation complete the fundamental transformation of the data, and the job completes the control of the entire workflow. Transformation works on the principle of concurrent streaming processing, can be used cluster distributed processing.
Like eclipse, Kettle is implemented in plug-in mode, where any individual or group can contribute plug-in code, and currently kettle supports many of the data sources, such as: most databases on the market, text files, Excel, XML, JSON files, and so on, Operations such as sorting, grouping, merging, row-to-column, column-changing, field merging and separating, connections between different data sources (such as database tables), import and export of database files, etc. can be done on the extracted data. It also supports the read and write of files on Hadoop, as well as the input and output of hbase, where the Tableinput component also supports the reading and writing of hive data, which is a rare tool in data integration.
I am currently working in the use, so here to recommend to everyone, if you have the use of students, welcome to communicate!
Interested friends can learn about. Address: kettle.pentaho.com

Here are the steps to read the files in HDFs in transfrmation:
1. Drag and drop the "Hadoop File Input" to the Transformation design interface
2. Double-click the control you just dragged or right-click Edit to enter the configuration window for "Hadoop File Input"
3. Click the "Browse" button to go to the Connection Configuration window
4. Enter the address and port number of the HDFs
5. Click on the "Connect" button and you will see that the section below will enter your HDFs file system
Then go to the directory you want to read and select the file you want to read. If you are reading multiple files, you can use a wildcard character representation.


6. configuration file Contents
① Select file type
② to set separators between fields
The ③ field has enclosing characters, some words need to fill in with the enclosing character, such as the default is double quotation marks; No words can be removed
Whether the ④ contains a file header, as contained, the first few lines are
⑤ file format, Unix or Windows?
⑥ sets the file character set. Otherwise, there will be garbled occurrences.

7. Set the fields to be read. According to the order of the text, from left to right, if read all, you can not fill in the paragraph (if there is a column header, that is, the file header mentioned in the previous step).
Output to HDFs Also, select "Hadoop File output". Similar configuration, I will not repeat. If you want to practice, you can read a file from HDFs and enter another directory in HDFs

Available for ETL tools under Hadoop--kettle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.