Spark Learning note -01 installation of Spark cluster

Source: Internet
Author: User

I. Overview


About what Spark is, why you're learning spark, and so on, don't say it, just look at this: http://spark.apache.org,

Let me just talk about some of the advantages of spark:

1, fast

Compared to Hadoop's MapReduce, Spark's memory-based operations are more than 100 times times faster, and hard disk-based operations are more than 10 times times faster. Spark implements an efficient DAG execution engine that can efficiently process data flows by using memory-based.

2. Easy to use

Spark supports Java, Python, and Scala APIs, and supports more than 80 advanced algorithms that allow users to quickly build different applications. And Spark supports the interactive Python and Scala shell, which makes it very easy to use the spark cluster in these shells to validate the problem-solving approach.

3. General

Spark provides a unified solution. Spark can be used for batch processing, interactive queries (Spark SQL), live stream processing (spark streaming), machine learning (Spark MLlib), and graph calculations (GraphX). These different types of processing can be used seamlessly in the same application. Spark's unified solution is very attractive, after all, any company wants to use a unified platform to deal with the problems encountered, reduce the development and maintenance of human costs and the material cost of the deployment platform.

4. Compatibility

Spark can be easily fused with other open source products. For example, spark can use the yarn and Apache Mesos of Hadoop as its resource management and scheduler, and can handle all data supported by Hadoop, including HDFs, HBase, and Cassandra. This is especially important for users who already have a Hadoop cluster deployed, because there is no need for any data migration to be able to use Spark's powerful processing power. Spark can also be independent of third-party resource management and scheduler, which implements standalone as its built-in resource management and scheduling framework, further reducing the use of spark, making it easy for everyone to deploy and use spark. In addition, Spark provides the tools to deploy standalone's spark cluster on EC2.
There's a bottom line: Spark is an alternative to mapreduce and is compatible with HDFs, Hive, and can be integrated into the Hadoop ecosystem to compensate for the lack of mapreduce.

Second, cluster installation

We do not have so many virtual machines on our own machine, the 3 virtual machine is enough (a master, two workers), of course, your Linux machine needs to install JDK and HDFS, we want to install Spark-1.6.1 best Use JDK7 above.

Let's take a brief look at some of the concepts of the spark cluster:

The spark cluster consists of only one master and Worker,master, the worker can have more than one, and the master and worker are kept in touch with RPC;

Master is responsible for managing metadata, and the worker is responsible for running the task, where details are detailed later;

There is also a driver, which is equivalent to the Spark cluster client, which is primarily responsible for submitting tasks to the Spark cluster, where details are detailed later.

1. Download the installation package

You don't have to write it, we download Spark-1.6.1

2. Upload and unzip the installation package

Unzip the installation package to the specified location: (Here I am using the spark user, I extracted the spark-1.6.1-bin-hadoop2.6.tgz into my host directory)

TAR-ZXVF Spark-1.6.1-bin-hadoop2.6.tgz–c/home/spark

3. Configure Spark

Go to the Conf directory under the Spark installation directory and rename and modify the Spark-env.sh.template file:

CD conf/

MV Spark-env.sh.template spark-env.sh

Add the following configuration in spark-env.sh:

VI spark-env.sh

Export java_home=[your Jdk_home directory]

Address of the node where the master of the export Spark_master_ip=[spark cluster resides: Master]

Export spark_master_port=7077[This port number is driver connection to SPARK cluster ports]

Rename and modify the Slaves.template file:

MV Slaves.template Slaves

In the file, add the location of the child nodes (that is, the worker node):

VI Slaves

[Address of the first worker node: Worker1]

[Address of the Second worker node: Worker2]

Copy the configured spark to the other nodes:

Scp-r Spark-1.6.1-bin-hadoop2.6/worker1:/home/spark

Scp-r Spark-1.6.1-bin-hadoop2.6/worker2:/home/spark

4, installation completed test start

By the time the spark cluster is installed, you need to start a test, which is currently 1 master,2 work, starting the Spark cluster on master: (note: Do not start start-all on the worker, details later)

/home/spark/spark-1.5.2-bin-hadoop2.6/sbin/start-all.sh

(Start-all.sh will start master and all workers and will not need to start on each worker)

After startup, execute the JPS command on each machine, the master process on the primary node, the worker process on the other child nodes, and log in to the Spark management interface to view the cluster status (the login address must be the master node because Spark WebUI on the master node):

http://"Master address" 8080/

So far, the spark cluster is installed.

But there is a big problem, that is the master node has a single point of failure, to solve this problem, it is necessary to use zookeeper, and start at least two master nodes to achieve high reliability, this is relatively simple, wait for later introduction. Of course, the worker will also hang up, but the worker's dead and alive is monitored by master, master and worker always have a heartbeat, Once the worker hangs up, the task that runs on the suspended worker is reassigned to the other worker, which is later described in detail.

Spark Learning note -01 installation of Spark cluster

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.