Syn Good son source: Http://www.cnblogs.com/cssdongl Welcome Reprint
Recently wrote the Hadoop MapReduce program summed up, found that a lot of logic is basically the same, and then thought can use ETL tools to configure related logic to implement the MapReduce code automatically generated and executed, This simplifies both the existing and later parts of the work. The Pentaho kettle, which is easy to get started with, and has been tested for the more mature, Hadoop-supported, logging down some of the configuration process and the pits that are encountered.
Kettle can be downloaded to the official website, but the official website will let you register to download and speed is not stable, so it is recommended to download this link, each version has, I use PDI (Pentaho Data Integration) 6.1, The cluster that needs to be connected is hadoop2.6.0-cdh5.4.0.
Enter the link in the 6.1 folder directly download Pdi-ce-6.1.0.1-196.zip decompression, enter the Data-integration root directory to start Spoon.bat, waiting for kettle to start successfully.
I. Preparatory work
Before configuring a PDI connection Bigdata source, you need to check that the source version that needs to be connected and the corresponding Pentaho components are compatible, such as
As you can see, the previously downloaded PDI (which belongs to PDI Spoon in the table above) is basically a support for mainstream data sources such as CDH,MAPR,EMR,HDP. The cluster I am connected to is CDH5.4 and within the scope of support.
Two. Configure the Pentaho component shims
Shims here my understanding is that Pentaho provides a series of adapters connected to each source, the specific configuration location according to Pentaho components to determine, now the configuration location of PDI spoon in: /data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations, pay attention to this place to configure the corresponding data source shims, if there are several kinds of data sources,
For example, I am currently connected to the CDH5.4.0, then I first empty the contents of the cdh55, then download the corresponding shims to extract the copy to the directory, the specific shims download location in
Https://sourceforge.net/projects/pentaho/files/Big%20Data%20Shims
Select the appropriate PDI version to enter, download the corresponding CDH version of shims, I need to download is pentaho-hadoop-shims-cdh54-package-61.2016.04.01-196-dist.zip, open the zip, Double-click Install.bat for shims decompression, the extracted cdh54 directory under the contents of all copied to the hadoop-configurations under the default cdh55 file inside (in fact, cdh55 this should be renamed to CDH54, However, after modifying the folder name, PDI can not find the configuration, where it should be set, temporarily not found, find the classmate could tell me.
This place must download the corresponding shims, or PDI even if you configure the correct CDH connection information, then in the use of the process will also report a variety of inexplicable errors.
Three. Edit the cluster configuration file
After completing the previous step, enter the Cdh55 directory, and the Hive-site.xml,mapred-site.xml,hbase-site.xml,core-site.xml,hdfs-site.xml on the CDH5.4 cluster will be Yarn-site.xml and other configuration files are copied to the current directory overlay. Then make some necessary changes.
Modify the Hive-site.xml to change the metastore of hive to be consistent with the cluster
<property> <name>hive.metastore.uris</name> <value> Modify the thrift address for the cluster </value ></property>
Modify the MAPRED-SITE.XM, if not, add and align with the cluster
<property> <name>mapreduce.jobhistory.address</name> <value> Modify the Jobhistory address for the cluster </value></property><property> <name>mapreduce.app-submission.cross-platform</ name> <value>true</value> </property>
Modify the value of the yarn-site.xml corresponding attribute, if not, add and align with the cluster
< Property> <name>Yarn.application.classpath</name> <value>$HADOOP _client_conf_dir, $HADOOP _conf_dir, $HADOOP _common_home/*, $HADOOP _common_home/lib/*, $HADOOP _hdfs_home/*,$ hadoop_hdfs_home/lib/*, $HADOOP _yarn_home/*, $HADOOP _yarn_home/lib/*</value></ Property>< Property> <name>Yarn.resourcemanager.hostname</name> <value>Clouderamanager.cdh5.test</value></ Property>< Property> <name>Yarn.resourcemanager.address</name> <value>clouderamanager.cdh5.test:8032</value></ Property>< Property> <name>Yarn.resourcemanager.admin.address</name> <value>clouderamanager.cdh5.test:8033</value></ Property>
Modify Config.properties Add the following properties, note that I am here CDH5.4 test, do not turn on Kerberos authentication
Authentication.superuser.provider=no_auth
If Kerberos authentication is turned on, you need to modify more parameters.
Four. Create a new cluster connection and test
After completing the above configuration, start Spoon.bat and enter the PDI development interface. Select Tools->hadoop distribution in the menu bar, select Cloudera CDH5.4 and click OK, then restart PDI.
In the left view view you will see Hadoop clusters and then right-click New cluster, as
Configure the corresponding cluster connection information (you can refer to the *.xml configuration file of the shims cluster), click "Test" to test, as follows
Make sure all the results turn green to indicate that the configuration is successful and that if there is red it is definitely the connection information and cluster inconsistencies.
Pentaho Kettle 6.1 Connecting CDH5.4.0 cluster