You are welcome to reprint it. Please indicate the source, huichiro.
Wedge
Hive is an open source data warehouse tool based on hadoop. It provides a hiveql language similar to SQL, this allows upper-layer data analysts to analyze massive data stored in HDFS without having to know too much about mapreduce. This feature has been widely welcomed.
An important module in the overall hive framework is the execution module, which is implemented using the mapreduce computing framework in hadoop. Therefore, the processing speed is not very satisfactory. Thanks to the excellent processing speed of spark, some people have successfully run hiveql execution using spark, which is a well-known shark open source project.
In Spark 1.0, spark itself provides hive support. This article does not want to analyze how spark provides hive support, but focuses on how to build a hive on spark testing environment.
Installation overview
The installation process is divided into the following steps:
- Build a hadoop cluster (The entire cluster consists of three machines, one as the master and the other two as the slave)
- Compile spark 1.0 to support hadoop 2.4.0 and hive
- Test Cases for running hive on spark(Spark and hadoop namenode run on the same machine)
Hadoop cluster Construction
Create a virtual machine
Create a KVM-based Virtual Machine and use the graphical management interface provided by libvirt to create three virtual machines, which is very convenient. The memory and IP address are allocated as follows:
- Master 2G 192.168.122.102
- Slave1 4G 192.168.122.103
- Slave2 4G 192.168.122.104
The process of installing the OS on a virtual machine is skipped. I am using arch Linux. After installing the OS, make sure that the following software has been installed.
- JDK
- OpenSSH
Create user groups and users
Create a user group named hadoop on each machine and add a user named hduser. The specific bash command is as follows:
groupadd hadoopuseradd -b /home -m -g hadoop hduserpasswd hduser
Logon without a password
When starting datanode or nodemanager on the slave machine, you need to enter the user name and password. To avoid entering the password every time, you can use the following command to create a password-free login. Note that there is no one-way password from the master to the slave machine.
cd $HOME/.sshssh-keygen -t dsa
Copy id_dsa.pub to authorized_keys and upload it to the $ home/. Ssh directory in slave1 and slave2.
CP id_dsa.pub authorized_keys # Make sure that the $ home directory of hduser has been created in the Server Load balancer instance and Server Load balancer instance. SSH directory SCP authorized_keys slave1: $ home /. sshscp authorized_keys slave2: $ home /. SSH
Change/etc/hosts on each machine
Add the following content to the/etc/hosts file in the master, slave1, and slave2 of the cluster.
192.168.122.102 master192.168.122.103 slave1192.168.122.104 slave2
After the modification is complete, you can run SSH slave1 on the master to perform the test. If the password is not entered, log on to slave1 directly, which indicates that the above configuration is successful.
Download hadoop 2.4.0
Log on to the master as an hduser and run the following command:
cd /home/hduserwget http://mirror.esocc.com/apache/hadoop/common/hadoop-2.4.0/hadoop-2.4.0.tar.gzmkdir yarntar zvxf hadoop-2.4.0.tar.gz -C yarn
Modify the hadoop configuration file and add the following content to. bashrc.
export HADOOP_HOME=/home/hduser/yarn/hadoop-2.4.0export HADOOP_MAPRED_HOME=$HADOOP_HOMEexport HADOOP_COMMON_HOME=$HADOOP_HOMEexport HADOOP_HDFS_HOME=$HADOOP_HOMEexport YARN_HOME=$HADOOP_HOMEexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
Modify $ hadoop_home/libexec/hadoop-config.sh
Add the following content at the beginning of the hadoop-config.sh File
export JAVA_HOME=/opt/java
$ Hadoop_conf_dir/yarn-env.sh
Add the following content at the beginning of the yarn-env.sh
export JAVA_HOME=/opt/javaexport HADOOP_HOME=/home/hduser/yarn/hadoop-2.4.0export HADOOP_MAPRED_HOME=$HADOOP_HOMEexport HADOOP_COMMON_HOME=$HADOOP_HOMEexport HADOOP_HDFS_HOME=$HADOOP_HOMEexport YARN_HOME=$HADOOP_HOMEexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
XML Configuration File Modification
File 1: $ hadoop_conf_dir/core-site.xml
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/hduser/yarn/hadoop-2.4.0/tmp</value> </property></configuration>
File 2: $ hadoop_conf_dir/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>
File 3: $ hadoop_conf_dir/mapred-site.xml
<?xml version="1.0"?><configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property></configuration>
File 4: $ hadoop_conf_dir/yarn-site.xml
<?xml version="1.0"?> <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8040</value> </property> </configuration>
File 5: $ hadoop_conf_dir/slaves
Add the following content to the file:
slave1slave2
Create tmp directory
Create the tmp directory under $ hadoop_home
mkdir $HADOOP_HOME/tmp
Copy the yarn directory to slave1 and slave2
The configuration file changed just now takes place on the master machine and copies all the changed content to slave1 and slave2.
for target in slave1 slave2do scp -r yarn $target:~/ scp $HOME/.bashrc $target:~/done
Batch Processing?
Format namenode
Format namenode on the master machine
bin/hadoop namenode -format
Start a cluster
sbin/hadoop-daemon.sh start namenodesbin/hadoop-daemons.sh start datanodesbin/yarn-daemon.sh start resourcemanagersbin/yarn-daemons.sh start nodemanagersbin/mr-jobhistory-daemon.sh start historyserver
Note:Daemon. ShIndicates to run only on the local machine,Daemons. ShIndicates running on all cluster nodes.
Verify that the hadoop cluster is correctly installed
Run a wordcount example. The specific steps are not listed. For details, refer to Article 11th in this series.
Compile spark 1.0.
Spark compilation is still very simple. Most of the causes of all failures can be attributed to the failure to download the dependent jar package.
To enable spark 1.0 to support hadoop 2.4.0 and hive, use the following command to compile
SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt assembly
If everything goes well, it will be generated under the Assembly directory.Spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar
Create a running package
After compilation, the size of all the files in the $ spark_home directory is still very large. There are about two multi-GB files. Which of the following directories and files are actually needed during running.
- $ Spark_home/bin
- $ Spark_home/sbin
- $ Spark_home/lib_managed
- $ Spark_home/Conf
- $ Spark_home/ASSEMBLY/targets/scala-2.10
Copy the contents of the preceding directory to/tmp/spark-Dist and create a compressed package.
mkdir /tmp/spark-distfor i in $SPARK_HOME/{bin,sbin,lib_managed,conf,assembly/target/scala-2.10}do cp -r $i /tmp/spark-distdonecd /tmp/tar czvf spark-1.0-dist.tar.gz spark-dist
Upload the running package to the master machine.
Upload the generated running package to the master (192.168.122.102)
scp spark-1.0-dist.tar.gz [email protected]192.168.122.102:~/
Run hive on spark Test Cases
After the above-mentioned torture, we finally reached the most tense moment.
Decompress spark-1.0-dist.tar.gz with the hduserid of the master host.
#after login into the master as hdusertar zxvf spark-1.0-dist.tar.gzcd spark-dist
Change CONF/spark-env.sh
export SPARK_LOCAL_IP=127.0.0.1export SPARK_MASTER_IP=127.0.0.1
The simplest example
Run the following scala code after starting the shell using the bin/spark-shell command:
val sc: SparkContext // An existing SparkContext.val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)// Importing the SQL context gives access to all the public SQL functions and implicit conversions.import hiveContext._hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")hql("LOAD DATA LOCAL INPATH ‘examples/src/main/resources/kv1.txt‘ INTO TABLE src")// Queries are expressed in HiveQLhql("FROM src SELECT key, value").collect().foreach(println)
If everything goes well, the last hql statement returns the key and value.
References
- Steps to install hadoop 2.x release (yarn or next-gen) on multi-node cluster