Http://www.aliyun.com/zixun/aggregation/13383.html ">spark is a cluster computing platform originating from the Amplab of the University of California, Berkeley, which is based on memory computing and has more performance than Hadoop , even with disk, the calculation of the iteration type will increase by 10 times times. Spark is a rare all-round player, starting from multiple iterations, eclectic data Warehouse, stream processing and graph calculation. Spark is now the top open source project for the Apache Foundation, with huge community support-the number of active developers has surpassed Hadoop MapReduce. Here, we share Xu Peng's "Apache Spark Source" series blog, from the source aspect of this popular large data computing framework for in-depth understanding.
The following are blog posts
Wedge
Source code reading is a very easy thing, is also a very difficult thing. It's easy to get the code right there and see it as soon as it's open. It is difficult to understand through the code why the author had to design this way, at the beginning of the design to solve the main problem.
In the spark of the source of the specific time before, if you want to quickly have a holistic understanding of spark, read Matei Zaharia Spark paper is a very good choice.
On the basis of reading this paper, combined with the Spark author in Developer Meetup speech Introduction to Spark internals, then the internal implementation of Spark will have a more general understanding.
With the above two articles to lay the foundation, and then to read the source code, then you will know the focus of analysis and difficulties.
Fundamental concepts (Basic concepts)
1. Rdd--resillient Distributed DataSet Elastic distributed dataset.
2. The various operations of operation--acting on Rdd are divided into transformation and action.
3. job--job, a job contains multiple RDD and various twist acting on the corresponding RDD.
4. stage--a job is divided into several stages.
5. partition--data partitions, a RDD data can be divided into a number of different areas.
6. dag--directed acycle Graph, a direction-free graph, a dependency relationship between the reaction Rdd.
7. Narrow dependency--narrow dependencies, the child RDD relies on the fixed data partition in the parent RDD.
8. Wide dependency--wide dependencies, the child rdd is dependent on all data partition in the parent RDD.
9. Caching managenment--Cache Management, the RDD of the intermediate results of the cache management to speed up the overall processing speed.
Programming models (programming model)
Rdd is a read-only collection of data partitions, with attention to datasets.
The twist that act on Rdd are divided into transformantion and action. After transformation processing, the contents of the dataset are changed, converted from DataSet A to DataSet B, and the content in the dataset is grouped into a specific value after action is processed.
Only when an action is on the RDD, all twist on the RDD and its parent Rdd are submitted to the cluster for real execution.
From code to dynamic running, the components involved are shown in the following illustration.
Demo Code
Run state (Runtime view)
No matter what the static model is, its dynamic operation is made up of processes, threads.
In spark terms, static view is called a DataSet view, and dynamic view is called Parition view. The relationship is shown in the figure
Tasks in spark can correspond to threads, and worker is a process, and the worker is managed by driver.
So the question is, how has this task evolved from RDD? This question will be answered in detail in the next section.
Deployment (Deployment view)
When an action action is applied to a RDD, the action is committed as a job.
In the process of submitting, the Dagscheduler module intervenes to compute the dependencies between Rdd. The dependence between Rdd forms a DAG.
Each job is divided into multiple stage, one of the main basis for dividing stage is whether the input of the current calculation factor is OK, and if it is divided into the same stage, avoid the message passing overhead between multiple stage.
When stage is committed, the task is computed by the TaskScheduler according to stage, and the task is submitted to the corresponding worker.
Spark supports the following deployment modes, Standalone, Mesos, and yarn. These deployment patterns are used as TaskScheduler initialization parameters.
RDD interface (RDD Interface)
RDD consists of the following major parts
Partitions--partition collection, how much data is in a RDD partition
Dependencies--rdd dependencies
Compute (parition)--what calculations are needed for a given dataset
preferredlocations--position preference for data partition
partitioner--how the calculated data results are distributed
caching mechanism (caching)
RDD Intermediate results can be cached, the cache first selected memory, if memory is not enough, will be written to disk.
Based on LRU (last-recent update), decide which content to keep in memory and which to save to disk.
Fault Tolerance (fault-tolerant)
From the beginning of the RDD to the derivation of the last Rdd, in the middle of a series of processing. So how to deal with the link between the wrong scene?
The solution provided by Spark is to replay only the failure of the data partition, without having to repeat the whole collection of the entire dataset, which can greatly speed up the cost of scenario recovery.
How does Rdd know his data partition number? If it is a HDFs file, then the block of HDFs file will become an important calculation basis.
Cluster management (Cluster management)
The task runs on top of cluster, and spark internally supports yarn and Mesos, in addition to the standalone deployment patterns provided by spark itself.
Yarn is responsible for computing resource scheduling and monitoring, restarting failed tasks based on monitoring results, or distributed tasks once new node joins cluster.
This part of the content requires a yarn document.
Summary
In the source code reading, the need to focus on the following two main lines.
Static view that is rdd,transformation and action
Dynamic view is life of a job, each job is divided into multiple stage, each stage can contain multiple RDD and its transformation, and how these stage map into tasks distributed to cluster