Spark-1.3.1 and hive integration for query analysis

Source: Internet
Author: User

In big data scenarios, using hive to do query statistical analysis should be aware that the computational delay is very large, may be a very complex statistical analysis needs, need to run more than 1 hours, but compared to the use of MySQL and other relational database analysis, the execution speed much faster. Using HIVEQL to write SQL-like query parsing statements, eventually through the Hive Query parser, translated into a mapreduce program on the Hadoop platform to run, which is the characteristics of the MapReduce compute engine delay PROBLEM: Map Intermediate result write file. If a HIVEQL statement is very complex and will be translated into more than one mapreduce Job, then there will be a lot of map output intermediate result data to the file, basically no data sharing.
If you use the Spark compute platform, based on the spark RDD data set model calculation, you can reduce the overhead of generating intermediate result data in the calculation process, and spark will put the data directly into memory for subsequent operations to share the data, reducing the latency associated with read-write disk I/O operations. In addition, if based on the spark on Yarn deployment mode, you can take full advantage of the data in the Hadoop cluster datanode node local (Locality) characteristics, reduce the communication overhead of data transmission.

Software preparation

I put the version of the related software used here listed here in order to test the validation as follows: CentOS-6.6 (Final) jdk-1.7.0_25 Maven-3.2.1 Hadoop-2.2.0 Spark-1.3.1 Hive-0.12.0 mysql-server-5.5.8

In addition to build a good Hadoop cluster, as well as the installation of the hive client, the ability to correctly perform query analysis on hive, the installation process is no longer described, you can refer to many documents on the Web. Since we are using the latest version of Spark-1.3.1, in order to use our existing 2.2.0 version of the Hadoop platform, it is necessary to recompile the build Spark program, which will be explained in more detail next.
Here, the structure topology of each cluster environment used is given, as shown in the following table:

SOURCE Node Service Name Description
Hadoop1 Spark Master/spark Driver Spark Cluster
Hadoop2 Datanode/nodemanager Hadoop cluster
Hadoop3 Datanode/nodemanager Hadoop cluster
Hadoop4 Hive Hive Client
Hadoop5 Spark Worker Spark Cluster
Hadoop6 Spark worker/namenode/resourcemanager/secondary NameNode Spark Cluster/hadoop Cluster
10.10.4.130 Mysql Used to store hive metadata

The above node configuration is the same, because it is a test machine, so the configuration is relatively low. We separately deployed the spark cluster and the Hadoop cluster worker and Nodemanager/datanode, and there was no data locality (Locality) feature when using spark for calculations, so if spark on Yarn patterns may get better computational performance improvements.

Spark compiled installation configuration

First from the official website under the Spark source file:

1 CD ~/
2 wget http://mirror.bit.edu.cn/apache/spark/spark-1.3.1/spark-1.3.1.tgz
3 Tar xvzf spark-1.3.1.tgz
4 Mv

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.