Spark is a class mapred computing framework developed by UC Berkeley Amplab. The Mapred framework applies to batch jobs, but because of its own framework constraints, first, pull-based heartbeat job scheduling. Second, the shuffle intermediate results all landed disk, resulting in high latency, start-up overhead is very large. And the spark is for iterative, interactive computing generation. First, it uses the Actor model Akka as the communication framework. Second, it uses the RDD distributed memory, the data between the operations do not need to dump to disk, but through the RDD partition distributed in each node memory, greatly improve the flow between the data, at the same time Rdd maintain the lineage relationship, once the Rdd fail off, can automatically reconstruct through the parent RDD, guaranteed the fault tolerance, but on Spark has the rich application, for instance Shark,spark streaming,mlbase. We have used shark in the production environment as a supplement to hive, which shares the Hive Metastore,serde, and is used in much the same way as hive, if the data input size is not very large, the same statement is actually much faster than the hive. Follow-up will write a separate article to be detailed.
Spark Software Stack
This article describes the installation of the following spark:
Spark can be run on the unified Resource scheduler, such as yarn, Mesos, and can also independently deploy the standalone mode, because we yarn the cluster has not yet, so temporarily with the standalone mode, it is the Master/slave mode, by a spark Master and a set of spark worker components, standalone mode only supports FIFO scheduling policies, and by default submitting a job takes away all of the core of the spark cluster, so that a cluster can run only one job. Need to set Spark.cores.max value to make adjustments
Deployment environment:
Spark Master:test85.hadoop
Spark Worker:test88.hadoop, Test89.hadoop, Test90.hadoop, Test91.hadoop
1. Ensure master and worker nodes ssh through
2. Because spark will use Hadoop client to interact with HDFS, each node needs to install the Hadoop client
3. Install Scala,scala 2.10.2 version and Spark Conflict, so only Scala can be installed 2.9.3
wget http://spark-project.org/download/spark-0.7.3-prebuilt-hadoop1.tgz
Tar xzvf Spark-0.7.3-prebuilt-hadoop1.tgz
ln-s spark-0.7.3 spark-release
Add environment variables to/etc/profile
Export Spark_home=/usr/local/spark-release export
scala_home=/usr/local/scala
export path= $PATH: $SPARK _ Home/bin: $SCALA _home/bin
Set the SPARK configuration file, in $spark_home/conf/spark-env.sh
Set up SPARK configuration file in $spark_home/conf/spark-env.sh
export java_home=/usr/local/jdk
export Scala_ Home=/usr/local/scala
export spark_examples_jar= $SPARK _home/examples/target/scala-2.9.3/spark-examples_ 2.9.3-0.7.3.jar
export spark_ssh_opts= "-p58422-o stricthostkeychecking=no"
export spark_master_ip= Test85.hadoop
export spark_master_webui_port=8088
export spark_worker_webui_port=8099 export
Spark_ worker_cores=4
export spark_worker_memory=8g
export Ld_library_path=/usr/local/hadoop/lzo/lib
Export spark_library_path=/usr/local/hadoop/hadoop-release/lib/native/linux-amd64-64
Spark_worker_cores is set to physical WORKER CPU cores, spark_worker_memory is the total amount of physical memory available to SPARK jobs on the WORKER node
Add a worker address to the slaves file
# A Spark Worker'll be started on each of the machines listes below Test88.hadoop test89.hadoop test90.ha
Doop
Test91.hadoop
Synchronizing profiles and Spark,scala to the entire cluster
Start Spark masterbin/start-master.sh
13/09/23 09:46:57 Info Slf4jeventhandler:slf4jeventhandler started
13/09/23 09:46:57 info actorsystemimpl: remoteserverstarted@akka://sparkmaster@test85.hadoop:7077
13/09/23 09:46:57 INFO master:starting Spark Master at spark://test85.hadoop:7077
13/09/23 09:46:57 INFO ioworker:ioworker thread ' spray-io-worker-0 ' started
13/09 /23 09:46:57 INFO Httpserver:akka://sparkmaster/user/httpserver started on/0.0.0.0:8088
Back to the column page: http://www.bianceng.cnhttp://www.bianceng.cn/Programming/extra/