Pentaho Kettle 6.1 Connecting CDH5.4.0 cluster

Last Update:2016-10-27 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Syn Good son source: Http://www.cnblogs.com/cssdongl Welcome Reprint

Recently wrote the Hadoop MapReduce program summed up, found that a lot of logic is basically the same, and then thought can use ETL tools to configure related logic to implement the MapReduce code automatically generated and executed, This simplifies both the existing and later parts of the work. The Pentaho kettle, which is easy to get started with, and has been tested for the more mature, Hadoop-supported, logging down some of the configuration process and the pits that are encountered.
Kettle can be downloaded to the official website, but the official website will let you register to download and speed is not stable, so it is recommended to download this link, each version has, I use PDI (Pentaho Data Integration) 6.1, The cluster that needs to be connected is hadoop2.6.0-cdh5.4.0.
Enter the link in the 6.1 folder directly download Pdi-ce-6.1.0.1-196.zip decompression, enter the Data-integration root directory to start Spoon.bat, waiting for kettle to start successfully.

I. Preparatory work

Before configuring a PDI connection Bigdata source, you need to check that the source version that needs to be connected and the corresponding Pentaho components are compatible, such as

As you can see, the previously downloaded PDI (which belongs to PDI Spoon in the table above) is basically a support for mainstream data sources such as CDH,MAPR,EMR,HDP. The cluster I am connected to is CDH5.4 and within the scope of support.

Two. Configure the Pentaho component shims

Shims here my understanding is that Pentaho provides a series of adapters connected to each source, the specific configuration location according to Pentaho components to determine, now the configuration location of PDI spoon in: /data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations, pay attention to this place to configure the corresponding data source shims, if there are several kinds of data sources,

　　
For example, I am currently connected to the CDH5.4.0, then I first empty the contents of the cdh55, then download the corresponding shims to extract the copy to the directory, the specific shims download location in

Https://sourceforge.net/projects/pentaho/files/Big%20Data%20Shims

Select the appropriate PDI version to enter, download the corresponding CDH version of shims, I need to download is pentaho-hadoop-shims-cdh54-package-61.2016.04.01-196-dist.zip, open the zip, Double-click Install.bat for shims decompression, the extracted cdh54 directory under the contents of all copied to the hadoop-configurations under the default cdh55 file inside (in fact, cdh55 this should be renamed to CDH54, However, after modifying the folder name, PDI can not find the configuration, where it should be set, temporarily not found, find the classmate could tell me.

　　This place must download the corresponding shims, or PDI even if you configure the correct CDH connection information, then in the use of the process will also report a variety of inexplicable errors.

Three. Edit the cluster configuration file

After completing the previous step, enter the Cdh55 directory, and the Hive-site.xml,mapred-site.xml,hbase-site.xml,core-site.xml,hdfs-site.xml on the CDH5.4 cluster will be Yarn-site.xml and other configuration files are copied to the current directory overlay. Then make some necessary changes.

Modify the Hive-site.xml to change the metastore of hive to be consistent with the cluster

<property>    <name>hive.metastore.uris</name>    <value> Modify the thrift address for the cluster </value ></property>

Modify the MAPRED-SITE.XM, if not, add and align with the cluster

<property>    <name>mapreduce.jobhistory.address</name>    <value> Modify the Jobhistory address for the cluster </value></property><property>    <name>mapreduce.app-submission.cross-platform</ name>    <value>true</value> </property>

Modify the value of the yarn-site.xml corresponding attribute, if not, add and align with the cluster

< Property>    <name>Yarn.application.classpath</name> <value>$HADOOP _client_conf_dir, $HADOOP _conf_dir, $HADOOP _common_home/*, $HADOOP _common_home/lib/*, $HADOOP _hdfs_home/*,$ hadoop_hdfs_home/lib/*, $HADOOP _yarn_home/*, $HADOOP _yarn_home/lib/*</value></ Property>< Property>    <name>Yarn.resourcemanager.hostname</name>    <value>Clouderamanager.cdh5.test</value></ Property>< Property>    <name>Yarn.resourcemanager.address</name>    <value>clouderamanager.cdh5.test:8032</value></ Property>< Property>    <name>Yarn.resourcemanager.admin.address</name>    <value>clouderamanager.cdh5.test:8033</value></ Property>

Modify Config.properties Add the following properties, note that I am here CDH5.4 test, do not turn on Kerberos authentication

Authentication.superuser.provider=no_auth

If Kerberos authentication is turned on, you need to modify more parameters.

Four. Create a new cluster connection and test

After completing the above configuration, start Spoon.bat and enter the PDI development interface. Select Tools->hadoop distribution in the menu bar, select Cloudera CDH5.4 and click OK, then restart PDI.

In the left view view you will see Hadoop clusters and then right-click New cluster, as

Configure the corresponding cluster connection information (you can refer to the *.xml configuration file of the shims cluster), click "Test" to test, as follows

Make sure all the results turn green to indicate that the configuration is successful and that if there is red it is definitely the connection information and cluster inconsistencies.

Pentaho Kettle 6.1 Connecting CDH5.4.0 cluster

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More