The operating architecture of Sparksql is described earlier, and the use of Sparksql is described later. Before we introduce the use of sparksql, we need to build a sparksql test environment. This test environment involves Hadoop's HDFs, Hive, Spark, and related data files, with information such as:
- Hadoop version is 2.2.0
- Hive version is 0.13
- Spark version is 1.1.0-RC3
- MySQL version is 5.6.12
- Test data Download location: Http://pan.baidu.com/s/1eQCbT30#path=%252Fblog in Sparksql_data.zip
Test environment:
This test environment is built on a physical machine, the physical machine is configured to 16G memory, 4 core 8 thread CPU. HADOOP1, HADOOP2, hadoop3 are vitual box VMS, build Hadoop clusters and spark clusters, physical machine Wyy as a client, write code and submit compute tasks. The overall test environment is configured as follows:
Machine name |
Configuration |
role |
Software Installation |
HADOOP1 |
4g memory, 1 cores |
Hadoop:nn/dn Spark:master/worker |
/app/hadoop/hadoop220/app/hadoop/spark110/app/scala2104/usr/java/jdk1.7.0_21 |
HADOOP2 |
4G Memory, 1 cores |
Hadoop:dn Spark:worker hive0.13 Client |
/app/hadoop/hadoop220/app/hadoop/spark110/app/hadoop/hive013/app/scala2104/usr/java/jdk1.7.0_21 |
HADOOP3 |
4G Memory, 1 cores |
Hadoop:dn Spark:worker hive0.13 Metaserver Service MySQL server |
/app/hadoop/hadoop220/app/hadoop/spark100/app/hadoop/hive013/app/scala2104/usr/java/jdk1.7.0_ 21mysql5.6.12 |
Wyy |
16G Memory, 4 cores |
Client hive0.13 Client |
/app/hadoop/hadoop220 /app/hadoop/spark110 /app/hadoop /hive013 |
The user attributes for the above hadoop220, Spark, and hive installation directories are Hadoop (group Hadoop), and the user properties of the other installation directories are root:root.
Test environment Build Order 1: Virtual cluster construction (HADOOP1, HADOOP2, HADOOP3) a:hadoop2.2.0 cluster build Reference blog hadoop2.2.0 test environment build or see video http://pan.baidu.com/s/ 1qwqfy4c Extract Password: xv4i
B:mysql Installation Reference blog mysql5.6.12 for Linux installation
C:hive Installation Reference blog Hive 0.11.0 Remote mode builds the hive0.13 used in this test, just like the hive0.11 installation. Hive is installed in Hadoop3, HADOOP2, Wyy. Where HADOOP3 starts Metastore serive;hadoop2, Wyy configures URIs after the client as hive.
d:spark1.1.0 Standalone Cluster Construction Reference blog Spark1.0.0 on Standalone mode deployment It is important to note that In this test, spark1.1.0 is used, the parameters of the deployment Package Generation command make-distribution.sh have changed, and the spark1.1.0 make-distribution.sh uses the format:
./make-distribution.sh [--name] [--tgz] [--with-tachyon] <maven build options>
The meaning of the parameter:--with-tachyon: Whether the memory file system Tachyon is supported, is not supported when this parameter is not added. --TGZ: Generate spark-$VERSION-bin.tar.gz in the root directory, do not add this parameter is not generated tgz file, only generate/dist directory. --name name: Combined with-tgz, you can generate spark-$VERSION-bin-$NAME. TGZ deployment package, when this parameter is not added, name is the version number of Hadoop. MAVEN Build Options: Configurations that you can use when using Maven compile, such as the option to use-p,-D, this minor build is based on hadoop2.2.0 and yarn and integrates hive, ganglia, ASL spark1.1.0 Deployment package, you can use the command:
./make-distribution.sh--tgz--name 2.2.0-pyarn-phadoop-2.2-pspark-ganglia-lgpl-pkinesis-asl-phive
Finally, the deployment package spark-1.1.0-bin-2.2.0.tgz is generated and installed according to the planning of the test environment.
2: Client Building client WYY uses the Ubuntu operating system, and the spark virtual cluster uses CentOS, the default Java installation directory two operating systems are not the same, So when installing Java under Ubuntu, we deliberately changed the Java installation path to CentOS. Otherwise, after each SCP of the virtual cluster's configuration file, modify the Java_home in the Hadoop, spark Run configuration file. The client hadoop2.2.0, Spark1.1.0, hive0.13 are directly from the SCP in the virtual cluster, placed in the same directory, with the same user attributes. The development tool uses the IntelliJ idea, which is compiled and packaged and copied to spark1.1.0 's root directory/app/hadoop/spark110, using Spark-submit to commit the virtual machine cluster to run.
3: File Data preparation start hadoop2.2.0 (only HDFs boot is required), then upload the data file to the corresponding directory: 4:hive data preparation in hive, define a database saledata, and three tables tbldate, Tblstock , Tblstockdetail, and load data, specific commands:
The CREATE DATABASE saledata;use saledata;//date.txt file defines the date classification that will be assigned each day to the respective month, week, quarter, etc. attributes//date, year, month, Day, week, week, quarter, Half month create TABLE tbldate (DateID string,theyearmonth string,theyear string,themonth string,thedate string,theweek String, Theweeks string,thequot string,thetenday string,thehalfmonth string) ROW FORMAT delimited fields TERMINATED by ', ' LINES T erminated by ' \ n ';//stock.txt file defines the order header/order number, trading location, trading date CREATE TABLE Tblstock (OrderNumber String,locationid string, DateID string) row FORMAT delimited fields TERMINATED by ', ' LINES TERMINATED by ' \ n ';//stockdetail.txt file defines the Order details//order number, line number, Goods, quantity, amount create TABLE tblstockdetail (ordernumber string,rownum int,itemid string,qty int,price int,amount int) ROW FORMAT Delimited fields TERMINATED by ', ' LINES TERMINATED by ' \ n ';//Load Data load/LOCAL Inpath '/home/mmicky/mboo/myclass/doc/s Parksql/data/date.txt ' into TABLE tbldate; LOAD DATA LOCAL inpath '/home/mmicky/mboo/myclass/doc/sparksql/data/stock.txt ' into TABLE tblstock; LOAD DATA LOCAL Inpath '/home/mmickY/mboo/myclass/doc/sparksql/data/stockdetail.txt ' into TABLE tblstockdetail;
Finally in HDFs you can see the relevant data: 5: Start enjoying the Sparksql tour ...
The spark big Data Fast computing platform (phase III) will be available recently in the into gold, this material is the new lesson material.
sparkSQL1.1: Setting up the test environment