In big data scenarios, using hive to do query statistical analysis should be aware that the computational delay is very large, may be a very complex statistical analysis needs, need to run more than 1 hours, but compared to the use of MySQL and other relational database analysis, the execution speed much faster. Using HIVEQL to write SQL-like query parsing statements, eventually through the Hive Query parser, translated into a mapreduce program on the Hadoop platform to run, which is the characteristics of the MapReduce compute engine delay PROBLEM: Map Intermediate result write file. If a HIVEQL statement is very complex and will be translated into more than one mapreduce Job, then there will be a lot of map output intermediate result data to the file, basically no data sharing.
If you use the Spark compute platform, based on the spark RDD data set model calculation, you can reduce the overhead of generating intermediate result data in the calculation process, and spark will put the data directly into memory for subsequent operations to share the data, reducing the latency associated with read-write disk I/O operations. In addition, if based on the spark on Yarn deployment mode, you can take full advantage of the data in the Hadoop cluster datanode node local (Locality) characteristics, reduce the communication overhead of data transmission.
Software preparation
I put the version of the related software used here listed here in order to test the validation as follows: CentOS-6.6 (Final) jdk-1.7.0_25 Maven-3.2.1 Hadoop-2.2.0 Spark-1.3.1 Hive-0.12.0 mysql-server-5.5.8
In addition to build a good Hadoop cluster, as well as the installation of the hive client, the ability to correctly perform query analysis on hive, the installation process is no longer described, you can refer to many documents on the Web. Since we are using the latest version of Spark-1.3.1, in order to use our existing 2.2.0 version of the Hadoop platform, it is necessary to recompile the build Spark program, which will be explained in more detail next.
Here, the structure topology of each cluster environment used is given, as shown in the following table:
| SOURCE Node |
Service Name |
Description |
| Hadoop1 |
Spark Master/spark Driver |
Spark Cluster |
| Hadoop2 |
Datanode/nodemanager |
Hadoop cluster |
| Hadoop3 |
Datanode/nodemanager |
Hadoop cluster |
| Hadoop4 |
Hive |
Hive Client |
| Hadoop5 |
Spark Worker |
Spark Cluster |
| Hadoop6 |
Spark worker/namenode/resourcemanager/secondary NameNode |
Spark Cluster/hadoop Cluster |
| 10.10.4.130 |
Mysql |
Used to store hive metadata |
The above node configuration is the same, because it is a test machine, so the configuration is relatively low. We separately deployed the spark cluster and the Hadoop cluster worker and Nodemanager/datanode, and there was no data locality (Locality) feature when using spark for calculations, so if spark on Yarn patterns may get better computational performance improvements.
Spark compiled installation configuration
First from the official website under the Spark source file:
| 2 |
wget http://mirror.bit.edu.cn/apache/spark/spark-1.3.1/spark-1.3.1.tgz |
| 3 |
Tar xvzf spark-1.3.1.tgz |