Spark is a class mapred computing framework developed by UC Berkeley Amplab. The Mapred framework applies to batch jobs, but because of its own framework constraints, first, pull-based heartbeat job scheduling. Second, the shuffle intermediate results all landed disk, resulting in high latency, start-up overhead is very large. And the spark is for iterative, interactive computing generation. First, it uses the Actor model Akka as the communication framework. Second, it uses the RDD distributed memory, the data between the operations do not need to dump to disk, but through the RDD Partition distributed in each node memory, greatly improve the flow between the data, at the same time Rdd maintain the lineage relationship, once the Rdd fail off, through the Father Rdd from Dynamic reconstruction ensures the fault tolerance, and there are rich applications on Spark, such as Shark,spark stre aming,mlbase. We have used shark in the production environment as a supplement to hive, which shares the Hive Metastore,serde, and is used in much the same way as hive, if the data input size is not very large, the same statement is actually much faster than the hive. Follow-up will write a separate article to be detailed.
Spark Software Stack
This article describes the installation of the following spark:
Spark can be run on the unified Resource scheduler, such as yarn, Mesos, and can also independently deploy the standalone mode, because we yarn the cluster has not yet, so temporarily with the standalone mode, it is the Master/slave mode, by a spark Master and a set of spark worker components, standalone mode only supports FIFO scheduling policies, and by default submitting a job takes away all of the core of the spark cluster, so that a cluster can run only one job. Need to set Spark.cores.max value to make adjustments
Deployment environment: Spark Master:test85.hadoop Spark Worker:test88.hadoop, Test89.hadoop, Test90.hadoop, Test91.hadoop
1. Ensure master and worker nodes ssh through 2. Because Spark will use the Hadoop client to interact with HDFS, each node needs to install Hadoop client 3. Installation Scala,scala version 2.10.2 has conflicts with spark, so you can only install Scala 2.9.3
wget http://spark-project.org/download/spark-0.7.3-prebuilt-hadoop1.tgz
Tar xzvf Spark-0.7.3-prebuilt-hadoop1.tgz
ln-s spark-0.7.3 spark-release
Add environment variables to/etc/profile
Export Spark_home=/usr/local/spark-release export
scala_home=/usr/local/scala
export path= $PATH: $SPARK _ Home/bin: $SCALA _home/bin
Set the configuration file for SPARK, in $ spark_home/conf/spark-env.sh
Set up SPARK configuration file in $spark_home/conf/spark-env.sh
export java_home=/usr/local/jdk
export Scala_ Home=/usr/local/scala
export spark_examples_jar= $SPARK _home/examples/target/scala-2.9.3/spark-examples_ 2.9.3-0.7.3.jar
export spark_ssh_opts= "-p58422-o stricthostkeychecking=no"
export spark_master_ip= Test85.hadoop
export spark_master_webui_port=8088
export spark_worker_webui_port=8099 export
Spark_ worker_cores=4
export spark_worker_memory=8g
export Ld_library_path=/usr/local/hadoop/lzo/lib
Export spark_library_path=/usr/local/hadoop/hadoop-release/lib/native/linux-amd64-64
Spark_worker_cores is set to physical WORKER CPU cores, spark_worker_memory is the total amount of physical memory available to SPARK jobs on the WORKER node
Add a worker address to the slaves file
# A Spark Worker'll be started on each of the machines listes below Test88.hadoop test89.hadoop test90.ha
Doop
Test91.hadoop
Synchronizing profiles and Spark,scala to the entire cluster
Start Spark Master bin/start-master.sh
13/09/23 09:46:57 Info Slf4jeventhandler:slf4jeventhandler started
13/09/23 09:46:57 info actorsystemimpl: remoteserverstarted@akka://sparkmaster@test85.hadoop:7077
13/09/23 09:46:57 INFO master:starting Spark Master at spark://test85.hadoop:7077
13/09/23 09:46:57 INFO ioworker:ioworker thread ' spray-io-worker-0 ' started
13/09 /23 09:46:57 INFO Httpserver:akka://sparkmaster/user/httpserver started on/0.0.0.0:8088
Start Spark workers bin/start-slaves.sh
13/09/23 09:47:54 Info Slf4jeventhandler:slf4jeventhandler started
13/09/23 09:47:55 info actorsystemimpl: remoteserverstarted@akka://sparkworker@test89.hadoop:36665
13/09/23 09:47:55 INFO worker:starting Spark Worker test89.hadoop:36665 with 4 cores, 8.0 GB RAM
13/09/23 09:47:55 INFO worker:spark Home:/usr/local/spark-0.7.3
13/ 09/23 09:47:55 Info worker:connecting to master spark://test85.hadoop:7077
13/09/23 09:47:56 Info actorsystemimpl:r emoteclientstarted@akka://sparkmaster@test85.hadoop:7077
13/09/23 09:47:57 INFO ioworker:ioworker thread ' Spray-io-worker-0 ' started
13/09/23 09:47:58 INFO Httpserver:akka://sparkworker/user/httpserver started on/ 0.0.0.0:8099
13/09/23 09:47:58 INFO worker:successfully registered with Master
Run the job that calculates pi./run spark.examples.SparkPi spark://test85.hadoop:7077
[Hadoop@test85 spark-release]$./run spark.examples.SparkPi spark://test85.hadoop:7077 13/09/23 10:15:59 INFO Slf4jeventhandler:slf4jeventhandler started 13/09/23 10:15:59 INFO sparkenv:registering BlockManagerMaster 13/09/23 10
: 15:59 INFO Memorystore:memorystore started with capacity 323.9 MB. 13/09/23 10:15:59 Info diskstore:created local directory at/tmp/spark-local-20130923101559-6a72 13/09/23 10:15:59 info C Onnectionmanager:bound socket to port 54795 with id = Connectionmanagerid (test85.hadoop,54795) 13/09/23 10:15:59 INFO Blo ckmanagermaster:trying to register Blockmanager 13/09/23 10:15:59 INFO blockmanagermaster:registered BlockManager 13/09 /23 10:15:59 Info httpbroadcast:broadcast server started at http://10.1.77.85:58290 13/09/23 10:15:59 info sparkenv:regi stering mapoutputtracker 13/09/23 10:15:59 INFO httpfileserver:http File Server directory is/tmp/spark-22ef9d2b-0e57-42 e2-ae90-a9cd99233c1c 13/09/23 10:16:00 INFO ioworker:ioworker thread ' spray-io-worker-0 ' STArted 13/09/23 10:16:00 INFO Httpserver:akka://spark/user/blockmanagerhttpserver started on/0.0.0.0:46611 13/09/23 10:16:00 info blockmanagerui:started blockmanager web UI at http://test85.hadoop:46611 13/09/23 10:16:00 INFO Sparkcontex t:added Jar/usr/local/spark-release/examples/target/scala-2.9.3/spark-examples_2.9.3-0.7.3.jar at http:// 10.1.77.85:51299/jars/spark-examples_2.9.3-0.7.3.jar with timestamp 1379902560222 13/09/23 10:16:00 INFO Client$ Clientactor:connecting to master spark://test85.hadoop:7077 13/09/23 10:16:00 INFO sparkdeployschedulerbackend: Connected to Spark cluster with app ID app-20130923101600-0000 13/09/23 10:16:00 INFO client$clientactor:executor: app-20130923101600-0000/0 on worker-20130923094755-test89.hadoop-36665 (Test89.hadoop) with 4 cores 13/09/23 10:16:00 INFO sparkdeployschedulerbackend:granted executor ID app-20130923101600-0000/0 on host Test89.hadoop with 4 cores, 512.0 MB RAM 13/09/23 10:16:00 INFO client$clientactor:executor Added:apP-20130923101600-0000/1 on worker-20130923094752-test90.hadoop-39876 (Test90.hadoop) with 4 cores 13/09/23 10:16:00 INFO sparkdeployschedulerbackend:granted executor ID APP-20130923101600-0000/1 on host Test90.hadoop with 4 cores, 512.0 MB RAM 13/09/23 10:16:00 INFO client$clientactor:executor ADDED:APP-20130923101600-0000/2 on worker-20130923094751-test91.hadoop-53527 (Test91.hadoop) with 4 cores 13/09/23 10:16:00 INFO sparkdeployschedulerbackend:granted executor ID APP-20130923101600-0000/2 on host Test91.hadoop with 4 cores, 512.0 MB RA M 13/09/23 10:16:00 INFO client$clientactor:executor ADDED:APP-20130923101600-0000/3 on worker-20130923094752-test88.hadoop-43591 (Test88.hadoop) with 4 cores 13/09/23 10:16:00 INFO sparkdeployschedulerbackend:granted executor ID APP-20130923101600-0000/3 on host Test88.hadoop with 4 cores, 512.0 MB RA M 13/09/23 10:16:00 Info sparkcontext:starting job:reduce at sparkpi.scala:22 13/09/23 10:16:00 info dagscheduler:got J OB 0 (Reduce at SPARKPI. scala:22) with 2 output partitions (allowlocal=false) 13/09/23 10:16:00 INFO dagscheduler:final stage:stage 0 (map at S PARKPI.SCALA:18) 13/09/23 10:16:00 info dagscheduler:parents of Final stage:list () 13/09/23 10:16:00 INFO DAGScheduler: Missing parents:list () 13/09/23 10:16:00 INFO dagscheduler:submitting Stage 0 (mappedrdd[1] at map at Sparkpi.scala:18), Which has no missing parents 13/09/23 10:16:00 INFO dagscheduler:submitting 2 missing tasks from Stage 0 (mappedrdd[1) a T map at sparkpi.scala:18) 13/09/23 10:16:00 INFO clusterscheduler:adding task set 0.0 with 2 tasks 13/09/23 10:16:00 INF O Client$clientactor:executor UPDATED:APP-20130923101600-0000/2 is now RUNNING 13/09/23 10:16:00 INFO Client$ClientActo R:executor UPDATED:APP-20130923101600-0000/3 is now RUNNING 13/09/23 10:16:00 INFO client$clientactor:executor Updated: APP-20130923101600-0000/1 is now RUNNING 13/09/23 10:16:00 INFO client$clientactor:executor updated:app-20130923101600 -0000/0 is now RUNNING 13/09/23 10:16:02 INFO sparkdeployschedulerbackend:registered executor:actor[akka://sparkexecutor@test90.hadoop:44054 /user/executor] with ID 1 13/09/23 10:16:02 INFO tasksetmanager:starting task 0.0:0 as TID 0 on Executor 1:test90.hadoop (preferred) 13/09/23 10:16:02 Info tasksetmanager:serialized task 0.0:0 as 1339 bytes in Ms 13/09/23 10:16:02 INFO Ta Sksetmanager:starting Task 0.0:1 as TID 1 on executor 1:test90.hadoop (preferred) 13/09/23 10:16:02 INFO Tasksetmanager: Serialized task 0.0:1 as 1339 bytes in 1 ms 13/09/23 10:16:02 INFO sparkdeployschedulerbackend:registered executor:acto R[akka://sparkexecutor@test91.hadoop:34433/user/executor] with ID 2 13/09/23 10:16:02 INFO Sparkdeployschedulerbackend:registered Executor:actor[akka://sparkexecutor@test89.hadoop:53079/user/executor] With ID 0 13/09/23 10:16:02 INFO blockmanagermasteractor$blockmanagerinfo:registering block manager test91.hadoop:49214 With 323.9 MB RAM 13/09/23 10:16:02 INFO Blockmanagermasteractor$blockmaNagerinfo:registering block manager test90.hadoop:33628 with 323.9 MB RAM 13/09/23 10:16:02 INFO sparkdeployschedulerback End:registered Executor:actor[akka://sparkexecutor@test88.hadoop:38074/user/executor] with ID 3 13/09/23 10:16:03 INFO blockmanagermasteractor$blockmanagerinfo:registering block manager test88.hadoop:55313 with 323.9 MB RAM 13/09/23 1 0:16:03 INFO blockmanagermasteractor$blockmanagerinfo:registering block manager test89.hadoop:37899 with 323.9 MB RAM 13 /09/23 10:16:03 Info tasksetmanager:finished TID 1 in 1128 MS (PROGRESS:1/2) 13/09/23 10:16:03 info Tasksetmanager:fini Shed TID 0 in 1175 MS (PROGRESS:2/2) 13/09/23 10:16:03 INFO dagscheduler:completed resulttask (0, 1) 13/09/23 10:16:03 in FO dagscheduler:completed resulttask (0, 0) 13/09/23 10:16:03 INFO dagscheduler:stage 0 (map at sparkpi.scala:18) Finishe D in 2.939 s 13/09/23 10:16:03 INFO sparkcontext:job finished:reduce at Sparkpi.scala:22, took 2.976491771 s Pi is rough Ly 3.14368
Reference Document: Http://spark.incubator.apache.org/docs/latest/spark-standalone.html
http://rdc.taobao.com/team/jm/archives/1823
http://zhuguangbin.github.io/blog/2013/07/16/spark-deploy/
This article link http://blog.csdn.net/lalaguozhe/article/details/11921493, reprint please specify