sparkSQL1.1: Setting up the test environment

Source: Internet
Author: User

The operating architecture of Sparksql is described earlier, and the use of Sparksql is described later. Before we introduce the use of sparksql, we need to build a sparksql test environment. This test environment involves Hadoop's HDFs, Hive, Spark, and related data files, with information such as:
    • Hadoop version is 2.2.0
    • Hive version is 0.13
    • Spark version is 1.1.0-RC3
    • MySQL version is 5.6.12
    • Test data Download location: Http://pan.baidu.com/s/1eQCbT30#path=%252Fblog in Sparksql_data.zip

Test environment:
This test environment is built on a physical machine, the physical machine is configured to 16G memory, 4 core 8 thread CPU. HADOOP1, HADOOP2, hadoop3 are vitual box VMS, build Hadoop clusters and spark clusters, physical machine Wyy as a client, write code and submit compute tasks. The overall test environment is configured as follows:
Machine name Configuration role Software Installation
HADOOP1 4g memory, 1 cores Hadoop:nn/dn Spark:master/worker /app/hadoop/hadoop220/app/hadoop/spark110/app/scala2104/usr/java/jdk1.7.0_21
HADOOP2 4G Memory, 1 cores Hadoop:dn Spark:worker
hive0.13 Client
/app/hadoop/hadoop220/app/hadoop/spark110/app/hadoop/hive013/app/scala2104/usr/java/jdk1.7.0_21
HADOOP3 4G Memory, 1 cores Hadoop:dn Spark:worker
hive0.13 Metaserver Service
MySQL server
/app/hadoop/hadoop220/app/hadoop/spark100/app/hadoop/hive013/app/scala2104/usr/java/jdk1.7.0_ 21mysql5.6.12
Wyy 16G Memory, 4 cores Client
hive0.13 Client
/app/hadoop/hadoop220
/app/hadoop/spark110
/app/hadoop /hive013
The user attributes for the above hadoop220, Spark, and hive installation directories are Hadoop (group Hadoop), and the user properties of the other installation directories are root:root.
Test environment Build Order 1: Virtual cluster construction (HADOOP1, HADOOP2, HADOOP3) a:hadoop2.2.0 cluster build Reference blog hadoop2.2.0 test environment build or see video http://pan.baidu.com/s/ 1qwqfy4c Extract Password: xv4i
B:mysql Installation Reference blog mysql5.6.12 for Linux installation
C:hive Installation Reference blog Hive 0.11.0 Remote mode builds the hive0.13 used in this test, just like the hive0.11 installation. Hive is installed in Hadoop3, HADOOP2, Wyy. Where HADOOP3 starts Metastore serive;hadoop2, Wyy configures URIs after the client as hive.
d:spark1.1.0 Standalone Cluster Construction Reference blog Spark1.0.0 on Standalone mode deployment It is important to note that In this test, spark1.1.0 is used, the parameters of the deployment Package Generation command make-distribution.sh have changed, and the spark1.1.0 make-distribution.sh uses the format:
./make-distribution.sh [--name] [--tgz] [--with-tachyon] <maven build options>
The meaning of the parameter:--with-tachyon: Whether the memory file system Tachyon is supported, is not supported when this parameter is not added. --TGZ: Generate spark-$VERSION-bin.tar.gz in the root directory, do not add this parameter is not generated tgz file, only generate/dist directory. --name name: Combined with-tgz, you can generate spark-$VERSION-bin-$NAME. TGZ deployment package, when this parameter is not added, name is the version number of Hadoop. MAVEN Build Options: Configurations that you can use when using Maven compile, such as the option to use-p,-D, this minor build is based on hadoop2.2.0 and yarn and integrates hive, ganglia, ASL spark1.1.0 Deployment package, you can use the command:
./make-distribution.sh--tgz--name 2.2.0-pyarn-phadoop-2.2-pspark-ganglia-lgpl-pkinesis-asl-phive
Finally, the deployment package spark-1.1.0-bin-2.2.0.tgz is generated and installed according to the planning of the test environment.
2: Client Building client WYY uses the Ubuntu operating system, and the spark virtual cluster uses CentOS, the default Java installation directory two operating systems are not the same, So when installing Java under Ubuntu, we deliberately changed the Java installation path to CentOS.      Otherwise, after each SCP of the virtual cluster's configuration file, modify the Java_home in the Hadoop, spark Run configuration file. The client hadoop2.2.0, Spark1.1.0, hive0.13 are directly from the SCP in the virtual cluster, placed in the same directory, with the same user attributes. The development tool uses the IntelliJ idea, which is compiled and packaged and copied to spark1.1.0 's root directory/app/hadoop/spark110, using Spark-submit to commit the virtual machine cluster to run.
3: File Data preparation start hadoop2.2.0 (only HDFs boot is required), then upload the data file to the corresponding directory: 4:hive data preparation in hive, define a database saledata, and three tables tbldate, Tblstock , Tblstockdetail, and load data, specific commands:
The CREATE DATABASE saledata;use saledata;//date.txt file defines the date classification that will be assigned each day to the respective month, week, quarter, etc. attributes//date, year, month, Day, week, week, quarter, Half month create TABLE tbldate (DateID string,theyearmonth string,theyear string,themonth string,thedate string,theweek String, Theweeks string,thequot string,thetenday string,thehalfmonth string) ROW FORMAT delimited fields TERMINATED by ', ' LINES T erminated by ' \ n ';//stock.txt file defines the order header/order number, trading location, trading date CREATE TABLE Tblstock (OrderNumber String,locationid string, DateID string) row FORMAT delimited fields TERMINATED by ', ' LINES TERMINATED by ' \ n ';//stockdetail.txt file defines the Order details//order number, line number, Goods, quantity, amount create TABLE tblstockdetail (ordernumber string,rownum int,itemid string,qty int,price int,amount int) ROW FORMAT Delimited fields TERMINATED by ', ' LINES TERMINATED by ' \ n ';//Load Data load/LOCAL Inpath '/home/mmicky/mboo/myclass/doc/s Parksql/data/date.txt ' into TABLE tbldate; LOAD DATA LOCAL inpath '/home/mmicky/mboo/myclass/doc/sparksql/data/stock.txt ' into TABLE tblstock; LOAD DATA LOCAL Inpath '/home/mmickY/mboo/myclass/doc/sparksql/data/stockdetail.txt ' into TABLE tblstockdetail; 
Finally in HDFs you can see the relevant data: 5: Start enjoying the Sparksql tour ...
The spark big Data Fast computing platform (phase III) will be available recently in the into gold, this material is the new lesson material.

sparkSQL1.1: Setting up the test environment

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.