Spark-1.3.1 and hive integration for query analysis

Last Update:2018-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In big data scenarios, using hive to do query statistical analysis should be aware that the computational delay is very large, may be a very complex statistical analysis needs, need to run more than 1 hours, but compared to the use of MySQL and other relational database analysis, the execution speed much faster. Using HIVEQL to write SQL-like query parsing statements, eventually through the Hive Query parser, translated into a mapreduce program on the Hadoop platform to run, which is the characteristics of the MapReduce compute engine delay PROBLEM: Map Intermediate result write file. If a HIVEQL statement is very complex and will be translated into more than one mapreduce Job, then there will be a lot of map output intermediate result data to the file, basically no data sharing.
If you use the Spark compute platform, based on the spark RDD data set model calculation, you can reduce the overhead of generating intermediate result data in the calculation process, and spark will put the data directly into memory for subsequent operations to share the data, reducing the latency associated with read-write disk I/O operations. In addition, if based on the spark on Yarn deployment mode, you can take full advantage of the data in the Hadoop cluster datanode node local (Locality) characteristics, reduce the communication overhead of data transmission.

Software preparation

I put the version of the related software used here listed here in order to test the validation as follows: CentOS-6.6 (Final) jdk-1.7.0_25 Maven-3.2.1 Hadoop-2.2.0 Spark-1.3.1 Hive-0.12.0 mysql-server-5.5.8

In addition to build a good Hadoop cluster, as well as the installation of the hive client, the ability to correctly perform query analysis on hive, the installation process is no longer described, you can refer to many documents on the Web. Since we are using the latest version of Spark-1.3.1, in order to use our existing 2.2.0 version of the Hadoop platform, it is necessary to recompile the build Spark program, which will be explained in more detail next.
Here, the structure topology of each cluster environment used is given, as shown in the following table:

SOURCE Node	Service Name	Description
Hadoop1	Spark Master/spark Driver	Spark Cluster
Hadoop2	Datanode/nodemanager	Hadoop cluster
Hadoop3	Datanode/nodemanager	Hadoop cluster
Hadoop4	Hive	Hive Client
Hadoop5	Spark Worker	Spark Cluster
Hadoop6	Spark worker/namenode/resourcemanager/secondary NameNode	Spark Cluster/hadoop Cluster
10.10.4.130	Mysql	Used to store hive metadata

The above node configuration is the same, because it is a test machine, so the configuration is relatively low. We separately deployed the spark cluster and the Hadoop cluster worker and Nodemanager/datanode, and there was no data locality (Locality) feature when using spark for calculations, so if spark on Yarn patterns may get better computational performance improvements.

Spark compiled installation configuration

First from the official website under the Spark source file:

CD ~/

2	wget http://mirror.bit.edu.cn/apache/spark/spark-1.3.1/spark-1.3.1.tgz

3	Tar xvzf spark-1.3.1.tgz

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark-1.3.1 and hive integration for query analysis

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support