CentOS6.4 configure a Spark-0.9 Cluster

Source: Internet
Author: User
Tags hadoop mapreduce hadoop fs

CentOS6.4 configure a Spark-0.9 Cluster

Spark is a fast and general computing cluster framework. Its kernel is written in Scala language. It provides Scala, Java, and Python programming languages with high-level APIs, using these APIs, you can easily develop applications for parallel processing.
Next, we will build a Spark Cluster Computing Environment and perform simple verification to experience the features of Spark computing. Whether you install the runtime environment or write the processing program (using Scala, Spark's default Shell environment can directly input Scala code for data processing ), we all think it is much simpler than Hadoop MapReduce computing framework, and Spark can interact well with HDFS (Reading data from HDFS and writing data to HDFS ).

Install configurations

  • Download and install Scala
1wgethttp://www.scala-lang.org/files/archive/scala-2.10.3.tgz2tarxvzf scala-2.10.3.tgz

In ~ /. Add the environment variable SCALA_HOME in bashrc and make it take effect:

1exportSCALA_HOME=/usr/scala/scala-2.10.32exportPATH=$PATH:$SCALA_HOME/bin
  • Download and install Spark

First, we configure the Spark program on the master node m1, and then copy and distribute the configured program files to each slave node in the cluster. Download and decompress:

1wgethttp://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop1.tgz2tarxvzf spark-0.9.0-incubating-bin-hadoop1.tgz

In ~ /. Add the environment variable SPARK_HOME in bashrc and make it take effect:

1exportSPARK_HOME=/home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop12exportPATH=$PATH:$SPARK_HOME/bin

Configure Spark on m1, modify the spark-env.sh profile:

1cd/home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/conf2cpspark-env.sh.template spark-env.sh

In this script file, configure SCALA_HOME to point to the path in Unix, for example:

1exportSCALA_HOME=/usr/scala/scala-2.10.3

Modify the conf/slaves file and add the Host Name of the computing node to the file, with one row, for example:

1s12s23s3

Finally, copy and distribute Spark program files and configuration files to slave nodes:

1scp-r ~/cloud/programs/spark-0.9.0-incubating-bin-hadoop1 shirdrn@s1:~/cloud/programs/2scp-r ~/cloud/programs/spark-0.9.0-incubating-bin-hadoop1 shirdrn@s2:~/cloud/programs/3scp-r ~/cloud/programs/spark-0.9.0-incubating-bin-hadoop1 shirdrn@s3:~/cloud/programs/

Start a Spark Cluster

We will use the data stored on the HDFS cluster as the input for computing. Therefore, we must first install and configure the Hadoop cluster and start it successfully. Here we use Hadoop 1.2.1. It is very easy to start a Spark computing cluster. Just execute the following command:

1cd/home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/2sbin/start-all.sh

We can see that a process named Master is started on m1, and a process named Worker is started on s1, as shown below. I also started the Hadoop cluster here:
On the master node m1:

154968 SecondaryNameNode255651 Master355087 JobTracker454814 NameNode56From node s1:733592 Worker833442 TaskTracker933336 DataNode

You can also view the logs to diagnose whether each process is successfully started. For example:

1On the master node:2tail-100f $SPARK_HOME/logs/spark-shirdrn-org.apache.spark.deploy.master.Master-1-m1.out3Slave node:4tail-100f $SPARK_HOME/logs/spark-shirdrn-org.apache.spark.deploy.worker.Worker-1-s1.out

Spark cluster computing Verification

The following is an example of how to use the access log file of my website:

127.159.254.192 - - [21/Feb/2014:11:40:46 +0800] "GET /archives/526.html HTTP/1.1" 200 12080 "http://shiyanjun.cn/archives/526.html" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"2120.43.4.206 - - [21/Feb/2014:10:37:37 +0800] "GET /archives/417.html HTTP/1.1" 200 11464 "http://shiyanjun.cn/archives/417.html/" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"

Calculate the frequency of IP addresses in the file to verify that the Spark cluster can be properly calculated. In addition, we need to read this log file from HDFS, count the IP address frequency, and save the result to the specified directory in HDFS.
First, you need to start the Spark Shell used to submit the computing task:

1bin/spark-shell

You can only use Scala to write code to run Spark Shell.
Then, execute the IP address frequency statistics and execute the following code in Spark Shell:

1valfile=sc.textFile("hdfs://m1:9000/user/shirdrn/wwwlog20140222.log")2valresult=file.flatMap(line=> line.split("\\s+.*")).map(word=> (word,1)).reduceByKey((a, b)=> a + b)

The above file hdfs: // m1: 9000/user/shirdrn/wwwlog20140222.log is the input log file. The log information of the processing process, for example:

0114/03/06 21:59:22 INFO MemoryStore: ensureFreeSpace(784) called with curMem=43296, maxMem=3113877500214/03/06 21:59:22 INFO MemoryStore: Block broadcast_11 stored as values to memory (estimated size 784.0 B, free 296.9 MB)0314/03/06 21:59:22 INFO FileInputFormat: Total input paths to process : 10414/03/06 21:59:22 INFO SparkContext: Starting job: collect at <console>:130514/03/06 21:59:22 INFO DAGScheduler: Registering RDD 84 (reduceByKey at <console>:13)0614/03/06 21:59:22 INFO DAGScheduler: Got job 10 (collect at <console>:13) with 1 output partitions (allowLocal=false)0714/03/06 21:59:22 INFO DAGScheduler: Final stage: Stage 20 (collect at <console>:13)0814/03/06 21:59:22 INFO DAGScheduler: Parents of final stage: List(Stage 21)0914/03/06 21:59:22 INFO DAGScheduler: Missing parents: List(Stage 21)1014/03/06 21:59:22 INFO DAGScheduler: Submitting Stage 21 (MapPartitionsRDD[84] at reduceByKey at <console>:13), which has no missing parents1114/03/06 21:59:22 INFO DAGScheduler: Submitting 1 missing tasks from Stage 21 (MapPartitionsRDD[84] at reduceByKey at <console>:13)1214/03/06 21:59:22 INFO TaskSchedulerImpl: Adding task set 21.0 with 1 tasks1314/03/06 21:59:22 INFO TaskSetManager: Starting task 21.0:0 as TID 19 on executor localhost: localhost (PROCESS_LOCAL)1414/03/06 21:59:22 INFO TaskSetManager: Serialized task 21.0:0 as 1941 bytes in 0 ms1514/03/06 21:59:22 INFO Executor: Running task ID 191614/03/06 21:59:22 INFO BlockManager: Found block broadcast_11 locally1714/03/06 21:59:22 INFO HadoopRDD: Input split:hdfs://m1:9000/user/shirdrn/wwwlog20140222.log:0+41795141814/03/06 21:59:23 INFO Executor: Serialized size of result for 19 is 7381914/03/06 21:59:23 INFO Executor: Sending result for 19 directly to driver2014/03/06 21:59:23 INFO TaskSetManager: Finished TID 19 in 211 ms on localhost (progress: 0/1)2114/03/06 21:59:23 INFO TaskSchedulerImpl: Remove TaskSet 21.0 from pool2214/03/06 21:59:23 INFO DAGScheduler: Completed ShuffleMapTask(21, 0)2314/03/06 21:59:23 INFO DAGScheduler: Stage 21 (reduceByKey at <console>:13) finished in 0.211 s2414/03/06 21:59:23 INFO DAGScheduler: looking for newly runnable stages2514/03/06 21:59:23 INFO DAGScheduler: running: Set()2614/03/06 21:59:23 INFO DAGScheduler: waiting: Set(Stage 20)2714/03/06 21:59:23 INFO DAGScheduler: failed: Set()2814/03/06 21:59:23 INFO DAGScheduler: Missing parents for Stage 20: List()2914/03/06 21:59:23 INFO DAGScheduler: Submitting Stage 20 (MapPartitionsRDD[86] at reduceByKey at <console>:13), which is now runnable3014/03/06 21:59:23 INFO DAGScheduler: Submitting 1 missing tasks from Stage 20 (MapPartitionsRDD[86] at reduceByKey at <console>:13)3114/03/06 21:59:23 INFO TaskSchedulerImpl: Adding task set 20.0 with 1 tasks3214/03/06 21:59:23 INFO Executor: Finished task ID 193314/03/06 21:59:23 INFO TaskSetManager: Starting task 20.0:0 as TID 20 on executor localhost: localhost (PROCESS_LOCAL)3414/03/06 21:59:23 INFO TaskSetManager: Serialized task 20.0:0 as 1803 bytes in 0 ms3514/03/06 21:59:23 INFO Executor: Running task ID 203614/03/06 21:59:23 INFO BlockManager: Found block broadcast_11 locally3714/03/06 21:59:23 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-zero-bytes blocks out of 1 blocks3814/03/06 21:59:23 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote gets in 1 ms3914/03/06 21:59:23 INFO Executor: Serialized size of result for 20 is 194234014/03/06 21:59:23 INFO Executor: Sending result for 20 directly to driver4114/03/06 21:59:23 INFO TaskSetManager: Finished TID 20 in 17 ms on localhost (progress: 0/1)4214/03/06 21:59:23 INFO TaskSchedulerImpl: Remove TaskSet 20.0 from pool4314/03/06 21:59:23 INFO DAGScheduler: Completed ResultTask(20, 0)4414/03/06 21:59:23 INFO DAGScheduler: Stage 20 (collect at <console>:13) finished in 0.016 s4514/03/06 21:59:23 INFO SparkContext: Job finished: collect at <console>:13, took 0.242136929 s4614/03/06 21:59:23 INFO Executor: Finished task ID 2047res14: Array[(String, Int)] = Array((27.159.254.192,28), (120.43.9.81,40), (120.43.4.206,16), (120.37.242.176,56), (64.31.25.60,2), (27.153.161.9,32), (202.43.145.163,24), (61.187.102.6,1), (117.26.195.116,12), (27.153.186.194,64), (123.125.71.91,1), (110.85.106.105,64), (110.86.184.182,36), (27.150.247.36,52), (110.86.166.52,60), (175.98.162.2,20), (61.136.166.16,1), (46.105.105.217,1), (27.150.223.49,52), (112.5.252.6,20), (121.205.242.4,76), (183.61.174.211,3), (27.153.230.35,36), (112.111.172.96,40), (112.5.234.157,3), (144.76.95.232,7), (31.204.154.144,28), (123.125.71.22,1), (80.82.64.118,3), (27.153.248.188,160), (112.5.252.187,40), (221.219.105.71,4), (74.82.169.79,19), (117.26.253.195,32), (120.33.244.205,152), (110.86.165.8,84), (117.26.86.172,136), (27.153.233.101,8), (123.12...

As you can see, some results after map and reduce calculation are output.
Finally, to save the result to HDFS, just enter the following code:

1result.saveAsTextFile("hdfs://m1:9000/user/shirdrn/wwwlog20140222.log.result")

View the result data on HDFS:

1[shirdrn@m1 ~]$ hadoop fs -cat/user/shirdrn/wwwlog20140222.log.result/part-00000 |head-52(27.159.254.192,28)3(120.43.9.81,40)4(120.43.4.206,16)5(120.37.242.176,56)6(64.31.25.60,2)

Reference connection

  • Http://spark.incubator.apache.org/examples.html
  • Http://spark.apache.org/docs/latest/
  • Http://spark.apache.org/releases/spark-release-0-9-0.html
  • Http://spark.apache.org/docs/latest/cluster-overview.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.