1 Introduction:
The project recently introduced Big Data technology, using it to process the processing day on-line data, need to kettle the source system text data load into the Hadoop environment
2 Preparatory work:
1 First
To understand the Kettle version of the support Hadoop , because the kettle data online less, so it is best to go to the official website, the URL:
Http://wiki.pentaho.com/display/BAD/Configuring+Pentaho+for+your+Hadoop+Distro+and+Version
Open this URL to the bottom of the page, such as:
Archive the following from PDI 4.3, from PDI 4.4, from PDI 5.0 means a PDI version that supports Hadoop. PDI is Pentaho data integration also known as kettle. PDI 4.3, PDI 4.4, PDI 5.0, which is Kettle 4.3, 4.4, 5.0, this version number includes a higher version (ie Kettle 5.0.X, 5.1,5.2 also support Hadoop).
2 Second
The different versions of Hadoop supported by the kettle version are not the same , with 5.1 as an example, the following link is 5.1 support situation
Http://wiki.pentaho.com/display/BAD/Configuring+Pentaho+for+your+Hadoop+Distro+and+Version
The middle part of the page opened for the link:
Determine the proper shim for Hadoop distro and version probably means choosing the right package for the Hadoop version. One line above the table: Apache, Cloudera, Hortonworks, Intel, mapr refer to the issuer. Click on them to select the publisher of the Hadoop you want to connect to. Take Apache Hadoop for example:
Version refers to the Hadoop release number, shim refers to the kettle provided to the Hadoop suite name, Download inside the included in 5.0,5.1 refers to Kettle 5.0, 5.1 version of the installation package has built-in plugins, in a word is kettle5.1 and version 5.0 already has plug-ins available to support Apache Hadoop version 0.20.x. No additional downloads are required. NS is not supported meaning the picture below also has an explanation.
Description is the case of Cloudera's Hadoop support, Download inside the Download Blue font hyperlink description is to be downloaded in addition to the next Kettle installation package, with included in 5.0,5.1 description Kettle 5.0,5. The 1 version itself is supported (built-in with plugins).
The conclusion from the above two figure is that Kettle 5.1 supports the Apache Hadoop 0.20.x version and Cloudera Hadoop CDH4.0 to CDH5.
3 Test run:
1 First configuration work
The current version of Hadoop I'm using is hadoop-2.2.0-cdh5.0 so with Kettle 5.1 and its built-in Hadoop plugin. Go to kettle website to download:
After the decompression is:
After the download, it is now necessary to do the configuration work, the configuration of the work in the kettle installation file to do:
Configuration method Reference: Http://wiki.pentaho.com/display/BAD/Hadoop
After entering the page, click Collapse to shrink all the menu trees as. Configuring Pentaho for your Hadoop distro and version means to do the configuration for the Hadoop version click in: Above the page is the above mentioned kettle to the support of Hadoop.
Let's go to the middle part of the page, such as:
1 It means that the Hadoop distribution you want to connect to is already supported by kettle, but there is no built-in plug-in that needs to be downloaded, which is best seen in this case: Install Hadoop distribution Shim
2 meaning you want to connect to the Hadoop distribution also has not been kettle support, you can fill in the corresponding information requirements Pentaho develop one.
There are 1 more cases where the Hadoop distribution is already supported by Kettle and has built-in plugins.
3 is configured.
3.1 Stop application is if kettle in the run first stop him.
3.2 Open the installation folder our side is kettle, so that's spoon. File path:
3.3 Edit Plugin.properties file
3.4 Change a configuration value to circle the place
Change to the shim value of your Hadoop (the shim in the table) My side is cdh50:
Save after Change:
This completes the configuration work.
2 then develop the script to work
Start the development script below official reference: Http://wiki.pentaho.com/display/BAD/Loading+Data+into+HDFS
Open Kettle Run Spoon.bat
:
Create a new KJB file drag a start entity
Drag one more.
Hadoop copy files is the load data inside HDFs.
Copy files inside the configuration:
This means that the path to the current KJB script is located on my side. The folder is:
Destination file is Hdfs://ip:hdfs Port/path
Click the Browse button to test before filling
such as: Fill in the server and port after clicking Connect if no error occurred in the red box inside the hdfs://... Indicates that the connection was successful (for example).
Note As long as the connection is successful, there is no problem with the kettle configuration of Hadoop.
You can run the script and try it:
For example, the script runs successfully.
View below the Hadoop home bin:
The file was successfully load.
At this point, kettle Load text data to HDFS success!
4 Notes:
All the steps can be referred to the official website:
Http://wiki.pentaho.com/display/BAD/Hadoop
1 is configuration 2 is to load data into Hadoop cluster 3 is to load data into HDFs and other to hive to HBase etc.
Kettle Connection Hadoop&hdfs Text detailed