"Note" This series of articles and the use of the installation package/test data can be in the "big gift--spark Getting Started Combat series" Get 1, compile spark
Spark can be compiled in SBT and maven two ways, and then the deployment package is generated through the make-distribution.sh script. SBT compilation requires the installation of Git tools, and MAVEN installation requires MAVEN tools, both of which need to be carried out under the network, by comparison found that SBT compilation speed is slow (for reasons of 1, time is not the same, SBT is compiled during the day, Maven is late at night, get the dependency packet speed of different 2, Maven downloads large files in multi-threaded, and SBT is a single process, and Maven compiles 3 or 4 hours before and after success.
1.1 Compiling Spark (SBT) 1.1.1 Installing git and compiling the installation
1. Download the git installation package from the following address
Http://www.onlinedown.net/softdown/169333_2.htm
https://www.kernel.org/pub/software/scm/git/
If Linux is the CentOS operating system, you can install it directly by using yum install git
Due to getting content from HTTPS, you need to install Curl-devel, which can be obtained from the following address
Http://rpmfind.net/linux/rpm2html/search.php?query=curl-devel
If Linux is the CentOS operating system can be installed by: Yum Install curl-devel directly
2. Upload git and unzip
Upload the git-1.7.6.tar.gz installation package to the/home/hadoop/upload directory, unzip it and put it in the/app directory
$CD/home/hadoop/upload/
$tar-xzf git-1.7.6.tar.gz
$MV Git-1.7.6/app
$ll/app
3. Compiling and installing Git
Install git in the same path as the root user
#yum Install Curl-devel
#cd/app/git-1.7.6
#./configure
#make
#make Install
4. Add git to your path path
Open/etc/profile to add git's path to the path parameter
Export git_home=/app/git-1.7.6
Export path= $PATH: $JAVA _home/bin: $MAVEN _home/bin: $GIT _home/bin
Log back in or use Source/etc/profile to make the parameters effective, and then use the GIT command to see if the configuration is correct
1.1.2 Download Spark source code and upload it
1. You can download the spark source code from the following address:
Http://spark.apache.org/downloads.html
Http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz
git clone https://github.com/apache/spark.git
Upload the downloaded spark-1.1.0.tgz source code package to the/home/hadoop/upload directory using the tool introduced by 1.1.3.1
2. Unzip on the master node
$CD/home/hadoop/upload/
$tar-xzf spark-1.1.0.tgz
3. Rename the spark-1.1.0 and move it to the/app/complied directory
$MV SPARK-1.1.0/APP/COMPLIED/SPARK-1.1.0-SBT
$ls/app/complied
1.1.3 Compiling code
When compiling the spark source code, you need to download the dependency package from the Internet, so the entire compilation process machine must be in the networked state. The compilation executes the following script:
$CD/APP/COMPLIED/SPARK-1.1.0-SBT
$SBT/SBT assembly-pyarn-phadoop-2.2-pspark-ganglia-lgpl-pkinesis-asl-phive
The entire compilation process compiles about more than 10 tasks, recompiling n times, which can take several or even more than 10 hours to compile (mainly depends on the speed of download dependent packages).
1.2 Compiling spark (MAVEN) 1.2.1 Installing MAVEN and configuring parameters
It is best to install more than 3.0 versions of Maven before compiling, adding the following settings to the/etc/profile configuration file:
Export maven_home=/app/apache-maven-3.0.5
Export path= $PATH: $JAVA _home/bin: $MAVEN _home/bin: $GIT _home/bin
1.2.2 Download Spark source code and upload it
1. You can download the spark source code from the following address:
Http://spark.apache.org/downloads.html
Http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz
git clone https://github.com/apache/spark.git
Upload the downloaded spark-1.1.0.tgz source code package to the/home/hadoop/upload directory using the tool introduced by 1.1.3.1
2. Unzip on the master node
$CD/home/hadoop/upload/
$tar-xzf spark-1.1.0.tgz
3. Rename the spark-1.1.0 and move it to the/app/complied directory
$MV SPARK-1.1.0/APP/COMPLIED/SPARK-1.1.0-MVN
$ls/app/complied
1.2.3 Compiling code
When compiling the spark source code, you need to download the dependency package from the Internet, so the entire compilation process machine must be in the networked state. The compilation executes the following script:
$CD/APP/COMPLIED/SPARK-1.1.0-MVN
$export maven_opts= "-xmx2g-xx:maxpermsize=512m-xx:reservedcodecachesize=512m"
$MVN-pyarn-phadoop-2.2-pspark-ganglia-lgpl-pkinesis-asl-phive-dskiptests Clean Package
The entire compilation process compiles about 24 tasks, and the entire process takes 1 hours and 45 minutes.
1.3 Generating a Spark deployment package
There is a script make-distribution.sh that generates the deployment package under the Spark source root directory, which can be packaged by executing the following command ./make-distribution.sh [--name] [--tgz] [-- With-tachyon] <maven Build options>
L --name name and--tgz combine to generate spark-$VERSION-bin-$NAME. TGZ deployment package, when this parameter is not added, name is the version number of Hadoop
L --tgz generate spark-$VERSION-bin.tgz in root directory, do not generate tgz file without this parameter, only generate/dist directory
L --with-tachyon supports memory file system Tachyon, does not support Tachyon when this parameter is not added
Example:
1. Generate a deployment package that supports yarn, hadoop2.2.0, Hive:
./make-distribution.sh--tgz--name 2.2.0-pyarn-phadoop-2.2-phive
2. Generate a deployment package that supports yarn, hadoop2.2.0, hive, ganglia:
./make-distribution.sh--tgz--name 2.2.0-pyarn-phadoop-2.2-pspark-ganglia-lgpl-p Hive
1.3.1 Generating a deployment package
Use the following command to generate the spark deployment package, as the script defaults to JDK1.6, and asks if you want to continue at the beginning, just select Y
$CD/app/complied/spark-1.1.0-mvn/
$./make-distribution.sh--tgz--name 2.2.0-pyarn-phadoop-2.2-pspark-ganglia-lgpl-p Hive
The build Spark deployment package compiles about 24 tasks, which takes about 1 hours and 38 minutes.
1.3.2 View Build Results
The build is located in the root directory of the deployment package, and the file name resembles Spark-1.1.0-bin-2.2.0.tgz.
2. Install Spark2.1 upload and unzip the spark installation package
1. We use the previous step to compile the Spark-1.1.0-bin-2.2.0.tgz file as the installation package (you can also download the native folder from the Web or the packaged 64-bit Hadoop installation package), using "Spark compilation and Deployment (UP)" in 1. 3.1 The tools introduced are uploaded to the/home/hadoop/upload directory
2. Unzip on the master node
$CD/home/hadoop/upload/
$tar-xzf spark-1.1.0-bin-2.2.0.tgz
3. Rename and move spark to the/app/hadoop directory
$MV spark-1.1.0-bin-2.2.0/app/hadoop/spark-1.1.0
$ll/app/hadoop
2.2 Configuring/etc/profile
1. Open the configuration file/etc/profile
$sudo Vi/etc/profile
2. Define the Spark_home and add the spark path to the path parameter
spark_home=/app/hadoop/spark-1.1.0
Path= $PATH: $SPARK _home/bin: $SPARK _home/sbin
2.3 Configuring Conf/slaves
1. Open the configuration file conf/slaves
$CD/app/hadoop/spark-1.1.0/conf
$sudo VI Slaves
2. Join the Slave configuration node
Hadoop1
Hadoop2
Hadoop3
2.4 Configuring Conf/spark-env.sh
1. Open the configuration file conf/spark-env.sh
$CD/app/hadoop/spark-1.1.0/conf
$CP spark-env.sh.template spark-env.sh
$sudo VI spark-env.sh
2. Join the Spark Environment configuration content and set HADOOP1 as the master node
Export SPARK_MASTER_IP=HADOOP1
Export spark_master_port=7077
Export Spark_worker_cores=1
Export Spark_worker_instances=1
Export spark_worker_memory=512m
2.5 Distributing spark programs to each node
1. Enter the HADOOP1 machine/app/hadoop directory and use the following command to copy the Spark folder to the HADOOP2 and HADOOP3 machine
$CD/app/hadoop
$SCP-R spark-1.1.0 [email protected]:/app/hadoop/
$SCP-R spark-1.1.0 [email protected]:/app/hadoop/
2. See if replication succeeded from the node
2.6 Start Spark
$CD/app/hadoop/spark-1.1.0/sbin
$./start-all.sh
2.7 Verify Startup
At this point, the processes running on HADOOP1 are: worker and master
At this time the processes running on HADOOP2 and HADOOP3 have only worker
View HADOOP1 node network conditions with the NETSTAT-NLT command
Enter the http://hadoop1:8080 in the browser (it is important to note that in the network settings except hadoop*, otherwise it will go to the external DNS resolution, there is no access to the situation) can go to the Spark cluster status page
2.8 Verifying client connections
Enter the HADOOP1 node and enter the spark's Bin directory to connect to the cluster using Spark-shell
$CD/app/hadoop/spark-1.1.0/bin
$spark-shell--master spark://hadoop1:7077--executor-memory 500m
Only the memory size is specified in the command and no number of cores is specified, so the client consumes all cores of the cluster and allocates 500M of memory at each node
3. Spark Test 3.1 using Spark-shell test
Here we test the Wordcout program that we all know in Hadoop, where the MapReduce implementation wordcout requires map, reduce, and job three parts, and even a single line in spark can be done. Here's how it's implemented:
3.1.1 Starting HDFs
$CD/app/hadoop/hadoop-2.2.0/sbin
$./start-dfs.sh
Through JPS observation of the start-up situation, the processes running on HADOOP1 are: NameNode, Secondarynamenode, and Datanode
The processes running above HADOOP2 and HADOOP3 are: Namenode and Datanode
3.1.2 Uploading data to HDFs
Upload the Hadoop configuration file Core-site.xml file as a test file to HDFs
$hadoop fs-mkdir-p/user/hadoop/testdata
$hadoop Fs-put/app/hadoop/hadoop-2.2.0/etc/hadoop/core-site.xml/user/hadoop/testdata
3.1.3 Start Spark
$CD/app/hadoop/spark-1.1.0/sbin
$./start-all.sh
3.1.4 Start Spark-shell
On the Spark client (here at the HADOOP1 node), use Spark-shell to connect the cluster
$CD/app/hadoop/spark-1.1.0/bin
$./spark-shell--master spark://hadoop1:7077--executor-memory 512m--driver-memory 500m
3.1.5 Running WordCount Scripts
Here is the execution script for WordCount, which is written in Scala, and the following is a one-line implementation:
Scala>sc.textfile ("Hdfs://hadoop1:9000/user/hadoop/testdata/core-site.xml"). FlatMap (_.split ("")). Map (x=> (x,1)). Reducebykey (_+_). Map (x=> (x._2,x._1)). Sortbykey (FALSE). Map (x=> (x._2,x._1)). Take (10)
In order to see the implementation process better, the following will be implemented on a line-by-row basis:
Scala>val rdd=sc.textfile ("Hdfs://hadoop1:9000/user/hadoop/testdata/core-site.xml")
Scala>rdd.cache ()
Scala>val Wordcount=rdd.flatmap (_.split ("")). Map (x=> (x,1)). Reducebykey (_+_)
Scala>wordcount.take (10)
Scala>val Wordsort=wordcount.map (x=> (x._2,x._1)). Sortbykey (FALSE). Map (x=> (x._2,x._1))
Scala>wordsort.take (10)
The word frequency statistic results are as follows:
array[(String, Int)] = Array (("", "+"), (the,7), (</property>,6), (<property>,6), (under,3), (in,3), ( license,3), (this,2), (-->,2), (file.,2))
3.1.6 observation of operation
With http://hadoop1:8080 view of the spark run, you can see that spark is 3 nodes each with 1 cores/512m memory, and the client allocates 3 cores with 512M of memory per core.
By clicking on the client running the task ID, you can see that the task is running on the HADOOP2 and HADOOP3 nodes, and it is not running on the HADOOP1, mainly due to the large memory consumption caused by HADOOP1 for Namenode and spark clients
3.2 Using Spark-submit test
Starting with Spark1.0.0, Spark provides an easy-to-use Application Deployment tool, Bin/spark-submit, for quick deployment of spark applications on local, Standalone, YARN, Mesos. The tool syntax and parameters are described below:
usage:spark-submit [options] <app jar | Python file> [app options]
Options:
--master Master_url spark://host:port, mesos://host:port, yarn, or Local.
--deploy-mode Deploy_mode driver Run, client running on native, cluster running in cluster
--class the class to run for the Class_name application package
--name Name Application names
--jars jars A comma-separated list of driver local jar packages and executor class paths
--py-files py_files comma-separated placement in Python applications
List of. zip,. Egg,. py files on the Pythonpath
--files files comma-separated list of file to be placed in each executor working directory
--properties-file file sets the location of the files for application properties, default is Conf/spark-defaults.conf
--driver-memory MEM driver memory size, default 512M
Java Options for--driver-java-options driver
--driver-library-path Driver Library path extra libraries path entries to pass to the driver
--driver-class-path driver classpath, jar packages added with--jars are automatically included in the Classpath
--executor-memory MEM Executor memory size, default 1G
Spark standalone with cluster deploy mode only:
--driver-cores NUM driver uses the number of cores, default is 1
--supervise If this parameter is set, driver failure will restart
Spark Standalone and Mesos only:
--total-executor-cores NUM Executor total number of cores used
Yarn-only:
--executor-cores NUM The number of cores used per executor, default is 1
--queue queue_name the queue to which yarn is submitted by the application, default
--num-executors Num starts the number of executor, default is 2
--archives Archives The list of files extracted to the working directory by each executor, separated by commas
3.2.1 Running Script 1
The script is the spark's own example, in which the value of pi is computed, and the following is the execution script:
$CD/app/hadoop/spark-1.1.0/bin
$./spark-submit--master spark://hadoop1:7077--class org.apache.spark.examples.SparkPi--executor-memory 512m.. /lib/spark-examples-1.1.0-hadoop2.2.0.jar 200
Parameter description (detailed reference to the above parameter description):
L --master Master address, can have mesos, spark, yarn and local four kinds, here is the Spark standalone cluster, the address is spark://hadoop1:7077
L --class The class name of the application call, this is Org.apache.spark.examples.SparkPi
L --executor-memory The amount of memory allocated per executor, here is 512M
L Execute jar package here is: /lib/spark-examples-1.1.0-hadoop2.2.0.jar
L Number of shards here is 200
3.2.2 Observation of operation
By observing that the spark cluster has 3 worker nodes and 1 applications running, each worker node is a 1 kernel/512m memory. Because there is no specified number of cores for the application, the application occupies all 3 cores of the cluster, and each node allocates 512M of memory.
Depending on the load per node, each node runs executor different, where the number of executor Hadoop1 is 0. While HADOOP3 executes executor number of 10, of which 5 exited states, 5 killed states.
3.2.3 Running Script 2
The script is the spark's own example, in which the value of pi is computed, the difference between script 1 specifies each executor kernel data, and the following is the execution script:
$CD/app/hadoop/spark-1.1.0/bin
$./spark-submit--master spark://hadoop1:7077--class org.apache.spark.examples.SparkPi--executor-memory 512m-- Total-executor-cores 2.. /lib/spark-examples-1.1.0-hadoop2.2.0.jar 200
Parameter description (detailed reference to the above parameter description):
L --master Master address, can have mesos, spark, yarn and local four kinds, here is the Spark standalone cluster, the address is spark://hadoop1:7077
L --class The class name of the application call, this is Org.apache.spark.examples.SparkPi
L --executor-memory The amount of memory allocated per executor, here is 512M
L --total-executor-cores 2 number of cores allocated per executor
L Execute jar package here is: /lib/spark-examples-1.1.0-hadoop2.2.0.jar
L Number of shards here is 200
3.2.4 observation of operation
By observing that the spark cluster has 3 worker nodes and 1 applications running, each worker node is a 1 kernel/512m memory. Because the specified application occupies 2 cores, the application uses all 2 cores of that cluster.
Spark Starter Combat Series--2.spark Compilation and Deployment (bottom)--spark compile and install