CentOS6.4 configure a Spark-0.9 Cluster
Spark is a fast and general computing cluster framework. Its kernel is written in Scala language. It provides Scala, Java, and Python programming languages with high-level APIs, using these APIs, you can easily develop applications for parallel processing.
Next, we will build a Spark Cluster Computing Environment and perform simple verification to experience the features of Spark computing. Whether you install the runtime environment or write the processing program (using Scala, Spark's default Shell environment can directly input Scala code for data processing ), we all think it is much simpler than Hadoop MapReduce computing framework, and Spark can interact well with HDFS (Reading data from HDFS and writing data to HDFS ).
Install configurations
- Download and install Scala
1
wgethttp://www.scala-lang.org/files/archive/scala-2.10.3.tgz
2
tar
xvzf scala-2.10.3.tgz
In ~ /. Add the environment variable SCALA_HOME in bashrc and make it take effect:
1
export
SCALA_HOME=/usr/scala/scala-2.10.3
2
export
PATH=$PATH:$SCALA_HOME/bin
- Download and install Spark
First, we configure the Spark program on the master node m1, and then copy and distribute the configured program files to each slave node in the cluster. Download and decompress:
1
wgethttp://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop1.tgz
2
tar
xvzf spark-0.9.0-incubating-bin-hadoop1.tgz
In ~ /. Add the environment variable SPARK_HOME in bashrc and make it take effect:
1
export
SPARK_HOME=/home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1
2
export
PATH=$PATH:$SPARK_HOME/bin
Configure Spark on m1, modify the spark-env.sh profile:
1
cd
/home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/conf
2
cp
spark-
env
.sh.template spark-
env
.sh
In this script file, configure SCALA_HOME to point to the path in Unix, for example:
1
export
SCALA_HOME=/usr/scala/scala-2.10.3
Modify the conf/slaves file and add the Host Name of the computing node to the file, with one row, for example:
1
s1
2
s2
3
s3
Finally, copy and distribute Spark program files and configuration files to slave nodes:
1
scp
-r ~/cloud/programs/spark-0.9.0-incubating-bin-hadoop1 shirdrn@s1:~/cloud/programs/
2
scp
-r ~/cloud/programs/spark-0.9.0-incubating-bin-hadoop1 shirdrn@s2:~/cloud/programs/
3
scp
-r ~/cloud/programs/spark-0.9.0-incubating-bin-hadoop1 shirdrn@s3:~/cloud/programs/
Start a Spark Cluster
We will use the data stored on the HDFS cluster as the input for computing. Therefore, we must first install and configure the Hadoop cluster and start it successfully. Here we use Hadoop 1.2.1. It is very easy to start a Spark computing cluster. Just execute the following command:
1
cd
/home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/
2
sbin/start-all.sh
We can see that a process named Master is started on m1, and a process named Worker is started on s1, as shown below. I also started the Hadoop cluster here:
On the master node m1:
1
54968 SecondaryNameNode
2
55651 Master
3
55087 JobTracker
4
54814 NameNode
5
6
From node s1:
7
33592 Worker
8
33442 TaskTracker
9
33336 DataNode
You can also view the logs to diagnose whether each process is successfully started. For example:
1
On the master node:
2
tail
-100f $SPARK_HOME/logs/spark-shirdrn-org.apache.spark.deploy.master.Master-1-m1.out
3
Slave node:
4
tail
-100f $SPARK_HOME/logs/spark-shirdrn-org.apache.spark.deploy.worker.Worker-1-s1.out
Spark cluster computing Verification
The following is an example of how to use the access log file of my website:
1
27.159.254.192 - - [21/Feb/2014:11:40:46 +0800] "GET /archives/526.html HTTP/1.1" 200 12080 "http://shiyanjun.cn/archives/526.html" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
2
120.43.4.206 - - [21/Feb/2014:10:37:37 +0800] "GET /archives/417.html HTTP/1.1" 200 11464 "http://shiyanjun.cn/archives/417.html/" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
Calculate the frequency of IP addresses in the file to verify that the Spark cluster can be properly calculated. In addition, we need to read this log file from HDFS, count the IP address frequency, and save the result to the specified directory in HDFS.
First, you need to start the Spark Shell used to submit the computing task:
1
bin/spark-shell
You can only use Scala to write code to run Spark Shell.
Then, execute the IP address frequency statistics and execute the following code in Spark Shell:
1
val
file
=
sc.textFile(
"hdfs://m1:9000/user/shirdrn/wwwlog20140222.log"
)
2
val
result
=
file.flatMap(line
=
> line.split(
"\\s+.*"
)).map(word
=
> (word,
1
)).reduceByKey((a, b)
=
> a + b)
The above file hdfs: // m1: 9000/user/shirdrn/wwwlog20140222.log is the input log file. The log information of the processing process, for example:
01
14/03/06 21:59:22 INFO MemoryStore: ensureFreeSpace(784) called with curMem=43296, maxMem=311387750
02
14/03/06 21:59:22 INFO MemoryStore: Block broadcast_11 stored as values to memory (estimated size 784.0 B, free 296.9 MB)
03
14/03/06 21:59:22 INFO FileInputFormat: Total input paths to process : 1
04
14/03/06 21:59:22 INFO SparkContext: Starting job: collect at <console>:13
05
14/03/06 21:59:22 INFO DAGScheduler: Registering RDD 84 (reduceByKey at <console>:13)
06
14/03/06 21:59:22 INFO DAGScheduler: Got job 10 (collect at <console>:13) with 1 output partitions (allowLocal=false)
07
14/03/06 21:59:22 INFO DAGScheduler: Final stage: Stage 20 (collect at <console>:13)
08
14/03/06 21:59:22 INFO DAGScheduler: Parents of final stage: List(Stage 21)
09
14/03/06 21:59:22 INFO DAGScheduler: Missing parents: List(Stage 21)
10
14/03/06 21:59:22 INFO DAGScheduler: Submitting Stage 21 (MapPartitionsRDD[84] at reduceByKey at <console>:13), which has no missing parents
11
14/03/06 21:59:22 INFO DAGScheduler: Submitting 1 missing tasks from Stage 21 (MapPartitionsRDD[84] at reduceByKey at <console>:13)
12
14/03/06 21:59:22 INFO TaskSchedulerImpl: Adding task set 21.0 with 1 tasks
13
14/03/06 21:59:22 INFO TaskSetManager: Starting task 21.0:0 as TID 19 on executor localhost: localhost (PROCESS_LOCAL)
14
14/03/06 21:59:22 INFO TaskSetManager: Serialized task 21.0:0 as 1941 bytes in 0 ms
15
14/03/06 21:59:22 INFO Executor: Running task ID 19
16
14/03/06 21:59:22 INFO BlockManager: Found block broadcast_11 locally
17
14/03/06 21:59:22 INFO HadoopRDD: Input split:hdfs://m1:9000/user/shirdrn/wwwlog20140222.log:0+4179514
18
14/03/06 21:59:23 INFO Executor: Serialized size of result for 19 is 738
19
14/03/06 21:59:23 INFO Executor: Sending result for 19 directly to driver
20
14/03/06 21:59:23 INFO TaskSetManager: Finished TID 19 in 211 ms on localhost (progress: 0/1)
21
14/03/06 21:59:23 INFO TaskSchedulerImpl: Remove TaskSet 21.0 from pool
22
14/03/06 21:59:23 INFO DAGScheduler: Completed ShuffleMapTask(21, 0)
23
14/03/06 21:59:23 INFO DAGScheduler: Stage 21 (reduceByKey at <console>:13) finished in 0.211 s
24
14/03/06 21:59:23 INFO DAGScheduler: looking for newly runnable stages
25
14/03/06 21:59:23 INFO DAGScheduler: running: Set()
26
14/03/06 21:59:23 INFO DAGScheduler: waiting: Set(Stage 20)
27
14/03/06 21:59:23 INFO DAGScheduler: failed: Set()
28
14/03/06 21:59:23 INFO DAGScheduler: Missing parents for Stage 20: List()
29
14/03/06 21:59:23 INFO DAGScheduler: Submitting Stage 20 (MapPartitionsRDD[86] at reduceByKey at <console>:13), which is now runnable
30
14/03/06 21:59:23 INFO DAGScheduler: Submitting 1 missing tasks from Stage 20 (MapPartitionsRDD[86] at reduceByKey at <console>:13)
31
14/03/06 21:59:23 INFO TaskSchedulerImpl: Adding task set 20.0 with 1 tasks
32
14/03/06 21:59:23 INFO Executor: Finished task ID 19
33
14/03/06 21:59:23 INFO TaskSetManager: Starting task 20.0:0 as TID 20 on executor localhost: localhost (PROCESS_LOCAL)
34
14/03/06 21:59:23 INFO TaskSetManager: Serialized task 20.0:0 as 1803 bytes in 0 ms
35
14/03/06 21:59:23 INFO Executor: Running task ID 20
36
14/03/06 21:59:23 INFO BlockManager: Found block broadcast_11 locally
37
14/03/06 21:59:23 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-zero-bytes blocks out of 1 blocks
38
14/03/06 21:59:23 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote gets in 1 ms
39
14/03/06 21:59:23 INFO Executor: Serialized size of result for 20 is 19423
40
14/03/06 21:59:23 INFO Executor: Sending result for 20 directly to driver
41
14/03/06 21:59:23 INFO TaskSetManager: Finished TID 20 in 17 ms on localhost (progress: 0/1)
42
14/03/06 21:59:23 INFO TaskSchedulerImpl: Remove TaskSet 20.0 from pool
43
14/03/06 21:59:23 INFO DAGScheduler: Completed ResultTask(20, 0)
44
14/03/06 21:59:23 INFO DAGScheduler: Stage 20 (collect at <console>:13) finished in 0.016 s
45
14/03/06 21:59:23 INFO SparkContext: Job finished: collect at <console>:13, took 0.242136929 s
46
14/03/06 21:59:23 INFO Executor: Finished task ID 20
47
res14: Array[(String, Int)] = Array((27.159.254.192,28), (120.43.9.81,40), (120.43.4.206,16), (120.37.242.176,56), (64.31.25.60,2), (27.153.161.9,32), (202.43.145.163,24), (61.187.102.6,1), (117.26.195.116,12), (27.153.186.194,64), (123.125.71.91,1), (110.85.106.105,64), (110.86.184.182,36), (27.150.247.36,52), (110.86.166.52,60), (175.98.162.2,20), (61.136.166.16,1), (46.105.105.217,1), (27.150.223.49,52), (112.5.252.6,20), (121.205.242.4,76), (183.61.174.211,3), (27.153.230.35,36), (112.111.172.96,40), (112.5.234.157,3), (144.76.95.232,7), (31.204.154.144,28), (123.125.71.22,1), (80.82.64.118,3), (27.153.248.188,160), (112.5.252.187,40), (221.219.105.71,4), (74.82.169.79,19), (117.26.253.195,32), (120.33.244.205,152), (110.86.165.8,84), (117.26.86.172,136), (27.153.233.101,8), (123.12...
As you can see, some results after map and reduce calculation are output.
Finally, to save the result to HDFS, just enter the following code:
1
result.saveAsTextFile(
"hdfs://m1:9000/user/shirdrn/wwwlog20140222.log.result"
)
View the result data on HDFS:
1
[shirdrn@m1 ~]$ hadoop fs -
cat
/user/shirdrn/wwwlog20140222.log.result/part-00000 |
head
-5
2
(27.159.254.192,28)
3
(120.43.9.81,40)
4
(120.43.4.206,16)
5
(120.37.242.176,56)
6
(64.31.25.60,2)
Reference connection
- Http://spark.incubator.apache.org/examples.html
- Http://spark.apache.org/docs/latest/
- Http://spark.apache.org/releases/spark-release-0-9-0.html
- Http://spark.apache.org/docs/latest/cluster-overview.html