Spark's first research note 11 slices-Spark a brief introduction

Source: Internet
Author: User
Tags apache mesos

The company launched the online project Spark has nearly 1 over time. Effective, spark in fact, excellent distributed computing platform to improve productivity.
Start this note. The previous seminar Spark Research Report was shared (it will be divided into articles due to space limitations), in order to help friends who have just contacted Spark get started as soon as possible.


Start the text below.

1. Project background
The Spark project was born in 2009 in UC Berkeley AMP lab and formally submitted to Apache Software Foundation in 2010 as an open source project. It is now a star project under Apache, and its code submission activity is among the most popular in the entire community.

2. Spark's performance
If all of the operations are completed in memory, compared to the streaming calculation supported by the Hadoop cluster map/reduce. In terms of computational speed, Spark has a 100x performance boost, and even if the operation produces intermediate files, the speed can be 10x times higher.

3. Compiling
Download the appropriate Spark version number from the official website, it is recommended to download the pre-built version number, can save a lot of dependencies.
Installing Doc's Build Guide compiles the spark source code with MAVEN, and some detail parameters need to be specified during compilation. Here no longer repeat, direct participation in the Examiner Network Guide can be.

4. Deployment mode for the spark cluster
4.1 Spark cluster Deployment model

The following deployment options are now supported:
1) Standalone mode
The spark cluster is deployed separately and is not coupled to any existing system, and cluster Manager is held by the master instance of Spark. This is the simplest deployment mode.
When deployed in detail, multiple master instances can be restarted, eliminating single points of failure with zookeeper. In order to achieve HA:ZOOKEEPR elected "active" master, the remaining instances standby, if the current master instance failure, then zookeeper from the standby instance select the new master.


2) Apache Mesos mode
Cluster Manager is performed by the Mesos master instance. Implement resource allocation and task scheduling.
3) Hadoop yarn mode
Cluster Manager from Yarn ResourceManager
Note: The spark program needs to introduce yarn support talent to yarn mode when compiling.
Specific instructions for these 3 deployment modes. Can participate in the Examiner Web document description.
4.2 Spark Cluster typical structure
From the cluster Mode overview documentation, a typical spark cluster consists of components such as those seen.

after the Spark app is submitted to the spark cluster. The spark context object created in the task script (also known as the driver program. In fact, a JVM process initiated on the Sparkclient machine is run in turn, such as the following steps:
1) Establish a connection to the cluster manager process
2) Request resources from the manager for the task (the executor process on the worker node). Implementation of detailed calculations and data storage)
3) Send the application code (JAR or. py file) to the requested executor process
4) Send the tasks to the executor process after the job->stages->tasks is decomposed by the scheduler and run by the latter
A few additional notes:
1) Spark context cannot and does not need to be aware of what type of Cluster Manager it is (3 possible: Spark Standalone/apache mesos/hadoop yarn), only it can pass Culster The manager applies to the executor process. It will be able to run the Spark app.
2) Each spark application will apply for a separate executor process (i.e., the corresponding processes of different spark tasks are independent of each other).Strengths:Implement application isolation, from the dispatch side. Different applications create their own spark context instances, and each driver only dispatches its own tasks; from the execution end, the executor processes of different applications are executed on the respective JVM.Disadvantages:If you do not use an external storage system. Data cannot be shared between different spark applications.


3) The executor process applied by the spark application remains alive for the duration of the application, and it runs the detailed calculation task multi-threading.
4) The driver program that the Spark context instance belongs to is responsible for scheduling tasks on the cluster. Therefore, on deployment, the driver deployment node should be as close to the worker node as possible. Preferably deployed on the same LAN.

Not to be continued. The next note will show you how to submit a compute task to the spark cluster through sparkclient.

"References"
1. Spark Overview
2. Cluster Mode Overview
3. Spark documentation PS: There are a lot of light-to-deep spark data

========================= EOF ====================


Copyright notice: This article Bo Master original articles, blogs, without consent may not be reproduced.

Spark's first research note 11 slices-Spark a brief introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.