Spark Pseudo-Distributed & fully distributed Installation Guide

Last Update:2018-07-26 Source: Internet

Author: User

Tags port number apache mesos

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark Pseudo-distributed & fully distributed Installation GuidePosted 4 months ago (2015-04-02 03:58) Read (3891) | Comments (5) 156 People favorite This article, I want to Favorites 6

Catalog [-] 0, preface 1, Installation Environment 2, pseudo-distributed installation 2.1 decompression, configuration environment variables can 2.2 let the configuration effective 2.3 start spark 2.4 Run the spark example program 2.4.1 Spark-shell 2.4.2 Run script 3, fully distributed set Group Install 3.1 Add environment variable 3.2 start Spark Cluster 3.3 Local mode run demo 3.4 Shell interactive Mode 4, a Scala & Spark Example 5, off topic: Embracing Scala 6, Refer: http: my.oschina.net/leejun2005/blog/394928
0. Preface

March 31 is Spark's five anniversary, starting with the first publicly released version, Spark has traveled through an extraordinary 5 years: from the beginning of obscurity, to the 13 's fame, the 14 outbreak. On top of the spark core are distributed machine learning, sql,streaming and graph computing libraries.

April 1 Spark officially announced that Spark 2.0 is refactoring the spark to better support mobile devices such as mobile phones. Hashjoin, one of Databricks's founders, revealed the refactoring approach: using the Scala.js project to compile the spark code into JavaScript and then use Safari/chrome to execute on the phone. A code can support Android/ios. However, considering the performance relationship, it may be necessary to rewrite the underlying network module to support Zero-copy. (Are you sure about April Fool's joke:) ）

OK, here we are. Spark currently supports a variety of distributed deployment methods: First, Standalone deploy mode, Amazon EC2, Apache Mesos, and Hadoop YARN. The first approach is to deploy individually (either standalone or clustered), without the need for a dependent resource manager, and the other three need to deploy spark to the corresponding resource manager.

In addition to the many ways to deploy, newer versions of Spark support a variety of Hadoop platforms, such as starting with the 0.8.1 version to support Hadoop 1 (HDP1, CDH3), CDH4, Hadoop 2 (HDP2, CDH5), respectively. At present Cloudera Company's CDH5 in the CM installation, you can directly select the Spark service to install.

Currently the latest version of Spark is 1.3.0, this article in version 1.3.0, to see how to implement the spark single-machine pseudo-distributed and distributed cluster installation. 1. Installation Environment

Spark 1.3.0 requires JDK1.6 or later, we use JDK 1.6.0_32 here;
Spark 1.3.0 requires Scala 2.10 or later, we use Scala 2.11.6;

Remember to configure the following Scala environment variables:

1 2 3

Vim/etc/profile export scala_home=/home/hadoop/software/scala-2 11.4 export path= $SCALA _home/bin: $PATH

2. Pseudo-Distributed installation 2.1 Extract and configure environment variables

Edit the/etc/profile or ~/.BASHRC file directly, then add the following environment variables:

1 2 3 4 5 6 7 8 9 10

Export hadoop_conf_dir= $HADOOP _home/etc/hadoop export scala_home=/home/hadoop/software/scala-2 11.4 export Java_ho Me=/home/hadoop/software/jdk1 7.0_67 Export spark_master=localhost export spark_local_ip=localhost export HADOOP_HOM E=/home/hadoop/software/hadoop-2.5.2 Export spark_home=/home/hadoop/software/spark-1.2.0-bin-hadoop2.4 Export Spark_libary_path=.: $JAVA _home/lib: $JAVA _home/jre/lib: $HADOOP _home/lib/native export yarn_conf_dir= $HADOOP _home/ Etc/hadoop export path= $PATH: $SCALA _home/bin: $SPARK _home/bin

2.2 Let the configuration take effect

Source/etc/profile

SOURCE ~/.BASHRC 2.3 start spark

Go to Spark_home/sbin, run:
start-all.sh
[Root@centos local]# JPS
7953 DataNode
8354 NodeManager
8248 ResourceManager
8104 Secondarynamenode
10396 Jps
7836 NameNode
7613 Worker
7485 Master
There is a master and worker process stating that the startup was successful
You can run the spark example program by http://localhost:8080/View Spark cluster condition 2.4 in two modes 2.4.1 Spark-shell

This mode is used for interactive programming, which is used as follows (go to Bin folder first)
?

1 2 3 4 5 6 7 8 9 10

./spark-shell. scala> val days = List ("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday") days:list[string] = List (Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday) scala> V Al Daysrdd = Sc.parallelize (days) daysrdd:org.apache.spark.rdd.rdd[string] = parallelcollectionrdd[0] at parallel Ize at <console>: Scala>daysrdd.count () scala>res 0:long = 7

2.4.2 Running Scripts

Run the SPARKPI in the example that spark comes with,
Note here that there are two ways to do this.
./bin/run-example Org.apache.spark.examples.SparkPi spark://localhost:7077
./bin/run-example Org.apache.spark.examples.SparkPi Local[3]
local, [3] represents 3 threads running
This will allow you to:
?

1 2 3 4 5 6

. /bin/run-example org.apache.spark.examples.SparkPi 2 Spark://192.168.0.120:7077 15/03/17 19:23:56 INFO SCHEDULER.D Agscheduler:completed resulttask (0, 0) 15/03/17 19:23:56 INFO Scheduler. Dagscheduler:stage 0 (reduce at sparkpi.scala:35) finished in 0.416 s 15/03/17 19:23:56 INFO Spark. Sparkcontext:job Finished:reduce at Sparkpi.scala:35, took 0.501835986 s Pi is roughly 3.14086

3. Fully Distributed cluster installation

In fact, the cluster installation method is also very simple. 3.1 Adding environment variables

1 2 3 4 5 6 7 8 9 10 11

CD spark-1.3.0 CP. /conf/spark-env. Sh.template. /conf/spark-env. SH vi. /conf/spark-env. SH Add the following: Export scala_home=/usr/lib/scala-2.10.3 Export java_home=/usr/java/jdk1.6.0_31 Export S park_master_ip=10.32.21.165 Export spark_worker_instances=3 export spark_master_port=8070 export spark_master_webui_ port=8090 Export spark_worker_port=8092 Export spark_worker_memory=5000m

Spark_master_ip this refers to the IP address of master, Spark_master_port This is the master port; Spark_master_webui_port This is the port number of the Web UI that looks at how the cluster is running ; Spark_worker_port this is the port number of each worker, spark_worker_memory this configures the running memory for each worker.

VI./conf/slaves the hostname of a worker per line (preferably with host mapping IP to hostname), as follows:

10.32.21.165
10.32.21.166
10.32.21.167

Set the SPARK_HOME environment variable and add Spark_home/bin to PATH:

Vi/etc/profile, add the following:
Export spark_home=/usr/lib/spark-1.3.0
Export path= $SPARK _home/bin: $PATH
Then synchronize the configuration and installation files to each node and let the environment variables take effect.
3.2 Launching the spark cluster

Execution./sbin/start-all.sh
If the Start-all method does not start the associated process properly, you can view the related error message in the $spark_home/logs directory. In fact, you can also start the related process as Hadoop and run the following command on the master node:
Execute on Master:./sbin/start-master.sh
Perform on worker:./sbin/start-slave.sh 3 spark://10.32.21.165:8070--webui-port 8090
Then check that the process is started, execute the JPS command, and you can see the worker process or the master process. You can then view the http://masterSpark:8090/on the Web UI to see all the work nodes, as well as their CPU count and memory information.
3.3 Local mode run demo

For example:./bin/run-example SPARKLR 2 Local or./bin/run-example SPARKPI 2 Local
The first of these two examples is the calculation of linear regression, iterative computation, the latter is the calculation of pi
3.4 Shell Interactive mode

./bin/spark-shell--master spark://10.32.21.165:8070 If Master is configured in conf/spark-env.sh (plus an export master=spark://${ Spark_master_ip}:${spark_master_port}), you can directly use the./bin/spark-shell to start.
Spark-shell, as an application, is to submit a job to the spark cluster, and then the spark cluster is assigned to the specific worker to process, and the worker reads the local file while the job is being processed.
This shell is a modified Scala shell, and opening a shell like this will see a running application in the Web UI.

4. A Scala & Spark example

This example first generates 150,000,000 random numbers with the shell, and then uses spark to count each random number frequency to see if the random number is evenly distributed.

1 2 3 4 5 6 7 8 9 Ten-All-in-one

getnum () { c= 1 while [[ $c -le 5000000 ] do & nbsp; echo $ (($RANDOM/500)) ((c + +)) done}<

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More