12 of Apache Spark Source code reading-build hive on spark Runtime Environment

Last Update:2014-07-07 Source: Internet

Author: User

Tags xsl arch linux

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

You are welcome to reprint it. Please indicate the source, huichiro.

Wedge

Hive is an open source data warehouse tool based on hadoop. It provides a hiveql language similar to SQL, this allows upper-layer data analysts to analyze massive data stored in HDFS without having to know too much about mapreduce. This feature has been widely welcomed.

An important module in the overall hive framework is the execution module, which is implemented using the mapreduce computing framework in hadoop. Therefore, the processing speed is not very satisfactory. Thanks to the excellent processing speed of spark, some people have successfully run hiveql execution using spark, which is a well-known shark open source project.

In Spark 1.0, spark itself provides hive support. This article does not want to analyze how spark provides hive support, but focuses on how to build a hive on spark testing environment.

Installation overview

The installation process is divided into the following steps:

Build a hadoop cluster (The entire cluster consists of three machines, one as the master and the other two as the slave)
Compile spark 1.0 to support hadoop 2.4.0 and hive
Test Cases for running hive on spark(Spark and hadoop namenode run on the same machine)

Hadoop cluster Construction Create a virtual machine

Create a KVM-based Virtual Machine and use the graphical management interface provided by libvirt to create three virtual machines, which is very convenient. The memory and IP address are allocated as follows:

Master 2G 192.168.122.102
Slave1 4G 192.168.122.103
Slave2 4G 192.168.122.104

The process of installing the OS on a virtual machine is skipped. I am using arch Linux. After installing the OS, make sure that the following software has been installed.

JDK
OpenSSH

Create user groups and users

Create a user group named hadoop on each machine and add a user named hduser. The specific bash command is as follows:

groupadd hadoopuseradd -b /home -m -g hadoop hduserpasswd hduser

Logon without a password

When starting datanode or nodemanager on the slave machine, you need to enter the user name and password. To avoid entering the password every time, you can use the following command to create a password-free login. Note that there is no one-way password from the master to the slave machine.

cd $HOME/.sshssh-keygen -t dsa

Copy id_dsa.pub to authorized_keys and upload it to the $ home/. Ssh directory in slave1 and slave2.

CP id_dsa.pub authorized_keys # Make sure that the $ home directory of hduser has been created in the Server Load balancer instance and Server Load balancer instance. SSH directory SCP authorized_keys slave1: $ home /. sshscp authorized_keys slave2: $ home /. SSH

Change/etc/hosts on each machine

Add the following content to the/etc/hosts file in the master, slave1, and slave2 of the cluster.

192.168.122.102 master192.168.122.103 slave1192.168.122.104 slave2

After the modification is complete, you can run SSH slave1 on the master to perform the test. If the password is not entered, log on to slave1 directly, which indicates that the above configuration is successful.

Download hadoop 2.4.0

Log on to the master as an hduser and run the following command:

cd /home/hduserwget http://mirror.esocc.com/apache/hadoop/common/hadoop-2.4.0/hadoop-2.4.0.tar.gzmkdir yarntar zvxf hadoop-2.4.0.tar.gz -C yarn

Modify the hadoop configuration file and add the following content to. bashrc.

export HADOOP_HOME=/home/hduser/yarn/hadoop-2.4.0export HADOOP_MAPRED_HOME=$HADOOP_HOMEexport HADOOP_COMMON_HOME=$HADOOP_HOMEexport HADOOP_HDFS_HOME=$HADOOP_HOMEexport YARN_HOME=$HADOOP_HOMEexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

Modify $ hadoop_home/libexec/hadoop-config.sh

Add the following content at the beginning of the hadoop-config.sh File

export JAVA_HOME=/opt/java

$ Hadoop_conf_dir/yarn-env.sh

Add the following content at the beginning of the yarn-env.sh

export JAVA_HOME=/opt/javaexport HADOOP_HOME=/home/hduser/yarn/hadoop-2.4.0export HADOOP_MAPRED_HOME=$HADOOP_HOMEexport HADOOP_COMMON_HOME=$HADOOP_HOMEexport HADOOP_HDFS_HOME=$HADOOP_HOMEexport YARN_HOME=$HADOOP_HOMEexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

XML Configuration File Modification

File 1: $ hadoop_conf_dir/core-site.xml

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration>  <property>    <name>fs.default.name</name>    <value>hdfs://master:9000</value>  </property>  <property>    <name>hadoop.tmp.dir</name>    <value>/home/hduser/yarn/hadoop-2.4.0/tmp</value>  </property></configuration>

File 2: $ hadoop_conf_dir/hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>   <property>     <name>dfs.replication</name>     <value>2</value>   </property>   <property>     <name>dfs.permissions</name>     <value>false</value>   </property> </configuration>

File 3: $ hadoop_conf_dir/mapred-site.xml

<?xml version="1.0"?><configuration> <property>   <name>mapreduce.framework.name</name>   <value>yarn</value> </property></configuration>

File 4: $ hadoop_conf_dir/yarn-site.xml

<?xml version="1.0"?> <configuration>  <property>    <name>yarn.nodemanager.aux-services</name>    <value>mapreduce_shuffle</value>  </property>  <property>    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>    <value>org.apache.hadoop.mapred.ShuffleHandler</value>  </property>  <property>    <name>yarn.resourcemanager.resource-tracker.address</name>    <value>master:8025</value>  </property>  <property>    <name>yarn.resourcemanager.scheduler.address</name>    <value>master:8030</value>  </property>  <property>    <name>yarn.resourcemanager.address</name>    <value>master:8040</value>  </property> </configuration>

File 5: $ hadoop_conf_dir/slaves

Add the following content to the file:

slave1slave2

Create tmp directory

Create the tmp directory under $ hadoop_home

mkdir $HADOOP_HOME/tmp

Copy the yarn directory to slave1 and slave2

The configuration file changed just now takes place on the master machine and copies all the changed content to slave1 and slave2.

for target in slave1 slave2do     scp -r yarn $target:~/    scp $HOME/.bashrc $target:~/done

Batch Processing?

Format namenode

Format namenode on the master machine

bin/hadoop namenode -format

Start a cluster

sbin/hadoop-daemon.sh start namenodesbin/hadoop-daemons.sh start datanodesbin/yarn-daemon.sh start resourcemanagersbin/yarn-daemons.sh start nodemanagersbin/mr-jobhistory-daemon.sh start historyserver

Note:Daemon. ShIndicates to run only on the local machine,Daemons. ShIndicates running on all cluster nodes.

Verify that the hadoop cluster is correctly installed

Run a wordcount example. The specific steps are not listed. For details, refer to Article 11th in this series.

Compile spark 1.0.

Spark compilation is still very simple. Most of the causes of all failures can be attributed to the failure to download the dependent jar package.

To enable spark 1.0 to support hadoop 2.4.0 and hive, use the following command to compile

SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true   SPARK_HIVE=true sbt/sbt assembly

If everything goes well, it will be generated under the Assembly directory.Spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar

Create a running package

After compilation, the size of all the files in the $ spark_home directory is still very large. There are about two multi-GB files. Which of the following directories and files are actually needed during running.

$ Spark_home/bin
$ Spark_home/sbin
$ Spark_home/lib_managed
$ Spark_home/Conf
$ Spark_home/ASSEMBLY/targets/scala-2.10

Copy the contents of the preceding directory to/tmp/spark-Dist and create a compressed package.

mkdir /tmp/spark-distfor i in $SPARK_HOME/{bin,sbin,lib_managed,conf,assembly/target/scala-2.10}do   cp -r $i /tmp/spark-distdonecd /tmp/tar czvf spark-1.0-dist.tar.gz spark-dist

Upload the running package to the master machine.

Upload the generated running package to the master (192.168.122.102)

scp spark-1.0-dist.tar.gz [email protected]192.168.122.102:~/

Run hive on spark Test Cases

After the above-mentioned torture, we finally reached the most tense moment.

Decompress spark-1.0-dist.tar.gz with the hduserid of the master host.

#after login into the master as hdusertar zxvf spark-1.0-dist.tar.gzcd spark-dist

Change CONF/spark-env.sh

export SPARK_LOCAL_IP=127.0.0.1export SPARK_MASTER_IP=127.0.0.1

The simplest example

Run the following scala code after starting the shell using the bin/spark-shell command:

val sc: SparkContext // An existing SparkContext.val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)// Importing the SQL context gives access to all the public SQL functions and implicit conversions.import hiveContext._hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")hql("LOAD DATA LOCAL INPATH ‘examples/src/main/resources/kv1.txt‘ INTO TABLE src")// Queries are expressed in HiveQLhql("FROM src SELECT key, value").collect().foreach(println)

If everything goes well, the last hql statement returns the key and value.

References

Steps to install hadoop 2.x release (yarn or next-gen) on multi-node cluster

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More