Apache Spark Source 1--Spark paper reading notes

Source: Internet
Author: User

Transferred from: http://www.cnblogs.com/hseagle/p/3664933.html

Wedge

Source reading is a very easy thing, but also a very difficult thing. The easy is that the code is there, and you can see it as soon as you open it. The hard part is to understand the reason why the author should have designed this in the first place, and what is the main problem to solve at the beginning of the design.

It's a good idea to read the spark paper from Matei Zaharia, before you take a concrete look at Spark's source code, and if you want to get a quick overview of spark.

On the basis of reading this paper, combined with the speech introduction to spark internals made by the spark author on Developer Meetup, there is a more general understanding of the internal implementation of spark.

With the above two articles laid the foundation, then to read the source code, then you will know the focus of analysis and difficulties.

Basic concept (Basic concepts)

Rdd-resillient distributed DataSet Elastic distributed data set

Operation-the various operations that act on the Rdd are divided into transformation and action

Job-Jobs, one job containing multiple RDD and various operation acting on the corresponding RDD

Stage-a job is divided into multiple stages

Partition-Data partitioning, data in an RDD can be divided into several different zones

dag-directed acycle Graph, a direction-free graph, reacting to the dependency relationship between RDD

Narrow dependency-Narrow dependency, the child RDD relies on the fixed data partition in the parent Rdd

Wide Dependency-wide dependency, child rdd is dependent on all data partition in the parent Rdd

Caching managenment-cache management, cache management of RDD intermediate calculations to speed up overall processing

Programming models (programming model)

The RDD is a read-only collection of data partitions , note that it is a dataset.

The operation acting on the Rdd are divided into transformantion and action. After transformation processing, the contents of the dataset are changed, the dataset A is converted to DataSet B, and the contents of the dataset are then normalized to a specific value after action has been processed.

Only if there is an action on the RDD, all operation on the RDD and its parent RDD will be submitted to cluster for real execution.

From code to dynamic running, the components involved are as shown.

Demo Code

val sc = new SparkContext("Spark://...", "MyJob", home, jars)val file = sc.textFile("hdfs://...")val errors = file.filter(_.contains("ERROR"))errors.cache()errors.count()
Run state (runtime view)

No matter what kind of static model, it does not depend on the process and thread when it is running dynamically.

In Spark's terminology, a static view is called a DataSet view, and dynamic view is called a parition view. Relationship

A task in spark can correspond to a thread, the worker is a process, and the worker is managed by driver.

So the question is, how does this task evolve from the RDD? This question will be answered in detail in the next section.

Deployment (Deployment view)

When an action action is applied to an RDD, the action is committed as a job.

During the commit process, the Dagscheduler module involves operations to calculate the dependency between the RDD. The dependency between the Rdd forms a DAG.

Each job is divided into multiple stages, one of the main basis for dividing the stage is whether the input of the current calculation factor is deterministic, and if so, it is divided into the same stage, avoiding the message passing overhead between multiple stage.

When the stage is committed, it is up to the TaskScheduler to calculate the required task based on the stage and submit the task to the corresponding worker.

Spark supports several deployment modes 1) Standalone 2) Mesos 3) yarn. These deployment modes will be used as initialization parameters for TaskScheduler.

Rdd interface (Rdd Interface)

The RDD is made up of several main parts

    1. Partitions--Partition collection, how many data partition in an RDD
    2. Dependencies--Rdd dependency relationship
    3. Compute (parition)-what calculations are required for a given data set
    4. Preferredlocations--Location preferences for data partition
    5. Partitioner-How to distribute the calculated data results
caching mechanism (caching)

The middle calculation of the RDD can be cached, the cache is selected for memory, and if memory is not sufficient, it will be written to disk.

Depending on the LRU (last-recent update), decide which content to persist in memory and which to save to disk.

Fault Tolerance (fault-tolerant)

From the initial rdd to the last rdd that is derived, a series of processing takes place in the middle. So how to deal with the middle of the error scene?

The solution offered by Spark is to replay only the failed data partition, without having to replay the entire data set, which can greatly speed up the cost of scene recovery.

How does the RDD know the number of its data partition? If it is an HDFs file, then the block of the HDFs file will be an important basis for calculation.

Cluster management (Cluster management)

The task runs above cluster, and spark supports yarn and mesos internally, in addition to the standalone deployment pattern provided by spark itself.

Yarn is responsible for scheduling and monitoring of computing resources, restarting failed tasks based on monitoring results, or re-distributed task once a new node joins cluster.

This part of the content needs to refer to yarn's documentation.

Summary

In the source reading, we need to focus on the following two main lines.

    • static View is RDD, transformation and action
    • Dynamic View is the life of a job, each job is divided into multiple stages, each stage can contain more than one RDD and its transformation, How these stages are mapped into tasks is distributed into cluster
References (Reference)
    1. Introduction to Spark Internals http://files.meetup.com/3138542/dev-meetup-dec-2012.pptx
    2. Resilient distributed datasets:a fault-tolerant abstraction for in-memory Cluster Computing Https://www.usenix.org/syst Em/files/.../nsdi12-final138.pdf
    3. Lightning-fast Cluster Computing with Spark and Shark http://www.meetup.com/TriHUG/events/112474102/

Apache Spark Source 1--Spark paper reading notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.