Apache Spark Source 1--Spark paper reading notes

Last Update:2015-11-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from: http://www.cnblogs.com/hseagle/p/3664933.html

Wedge

Source reading is a very easy thing, but also a very difficult thing. The easy is that the code is there, and you can see it as soon as you open it. The hard part is to understand the reason why the author should have designed this in the first place, and what is the main problem to solve at the beginning of the design.

It's a good idea to read the spark paper from Matei Zaharia, before you take a concrete look at Spark's source code, and if you want to get a quick overview of spark.

On the basis of reading this paper, combined with the speech introduction to spark internals made by the spark author on Developer Meetup, there is a more general understanding of the internal implementation of spark.

With the above two articles laid the foundation, then to read the source code, then you will know the focus of analysis and difficulties.

Basic concept (Basic concepts)

Rdd-resillient distributed DataSet Elastic distributed data set

Operation-the various operations that act on the Rdd are divided into transformation and action

Job-Jobs, one job containing multiple RDD and various operation acting on the corresponding RDD

Stage-a job is divided into multiple stages

Partition-Data partitioning, data in an RDD can be divided into several different zones

dag-directed acycle Graph, a direction-free graph, reacting to the dependency relationship between RDD

Narrow dependency-Narrow dependency, the child RDD relies on the fixed data partition in the parent Rdd

Wide Dependency-wide dependency, child rdd is dependent on all data partition in the parent Rdd

Caching managenment-cache management, cache management of RDD intermediate calculations to speed up overall processing

Programming models (programming model)

The RDD is a read-only collection of data partitions , note that it is a dataset.

The operation acting on the Rdd are divided into transformantion and action. After transformation processing, the contents of the dataset are changed, the dataset A is converted to DataSet B, and the contents of the dataset are then normalized to a specific value after action has been processed.

Only if there is an action on the RDD, all operation on the RDD and its parent RDD will be submitted to cluster for real execution.

From code to dynamic running, the components involved are as shown.

Demo Code

val sc = new SparkContext("Spark://...", "MyJob", home, jars)val file = sc.textFile("hdfs://...")val errors = file.filter(_.contains("ERROR"))errors.cache()errors.count()

Run state (runtime view)

No matter what kind of static model, it does not depend on the process and thread when it is running dynamically.

In Spark's terminology, a static view is called a DataSet view, and dynamic view is called a parition view. Relationship

A task in spark can correspond to a thread, the worker is a process, and the worker is managed by driver.

So the question is, how does this task evolve from the RDD? This question will be answered in detail in the next section.

Deployment (Deployment view)

When an action action is applied to an RDD, the action is committed as a job.

During the commit process, the Dagscheduler module involves operations to calculate the dependency between the RDD. The dependency between the Rdd forms a DAG.

Each job is divided into multiple stages, one of the main basis for dividing the stage is whether the input of the current calculation factor is deterministic, and if so, it is divided into the same stage, avoiding the message passing overhead between multiple stage.

When the stage is committed, it is up to the TaskScheduler to calculate the required task based on the stage and submit the task to the corresponding worker.

Spark supports several deployment modes 1) Standalone 2) Mesos 3) yarn. These deployment modes will be used as initialization parameters for TaskScheduler.

Rdd interface (Rdd Interface)

The RDD is made up of several main parts

Partitions--Partition collection, how many data partition in an RDD
Dependencies--Rdd dependency relationship
Compute (parition)-what calculations are required for a given data set
Preferredlocations--Location preferences for data partition
Partitioner-How to distribute the calculated data results

caching mechanism (caching)

The middle calculation of the RDD can be cached, the cache is selected for memory, and if memory is not sufficient, it will be written to disk.

Depending on the LRU (last-recent update), decide which content to persist in memory and which to save to disk.

Fault Tolerance (fault-tolerant)

From the initial rdd to the last rdd that is derived, a series of processing takes place in the middle. So how to deal with the middle of the error scene?

The solution offered by Spark is to replay only the failed data partition, without having to replay the entire data set, which can greatly speed up the cost of scene recovery.

How does the RDD know the number of its data partition? If it is an HDFs file, then the block of the HDFs file will be an important basis for calculation.

Cluster management (Cluster management)

The task runs above cluster, and spark supports yarn and mesos internally, in addition to the standalone deployment pattern provided by spark itself.

Yarn is responsible for scheduling and monitoring of computing resources, restarting failed tasks based on monitoring results, or re-distributed task once a new node joins cluster.

This part of the content needs to refer to yarn's documentation.

Summary

In the source reading, we need to focus on the following two main lines.

static View is RDD, transformation and action
Dynamic View is the life of a job, each job is divided into multiple stages, each stage can contain more than one RDD and its transformation, How these stages are mapped into tasks is distributed into cluster

References (Reference)

Introduction to Spark Internals http://files.meetup.com/3138542/dev-meetup-dec-2012.pptx
Resilient distributed datasets:a fault-tolerant abstraction for in-memory Cluster Computing Https://www.usenix.org/syst Em/files/.../nsdi12-final138.pdf
Lightning-fast Cluster Computing with Spark and Shark http://www.meetup.com/TriHUG/events/112474102/

Apache Spark Source 1--Spark paper reading notes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Apache Spark Source 1--Spark paper reading notes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Apache Spark Source 1--Spark paper reading notes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support