Spark: The Lightning flint of the big Data age

Source: Internet
Author: User
Keywords Very can realize two run

Spark is a cluster computing platform that originated at the University of California, Berkeley Amplab. It is based on memory calculation, from many iterations of batch processing, eclectic data warehouse, flow processing and graph calculation and other computational paradigm, is a rare all-round player.

Spark has formally applied to join the Apache incubator, from the "Spark" of the laboratory "" EDM into a large data technology platform for the emergence of the new sharp. This article mainly narrates the design thought of Spark. Spark, as its name shows, is an uncommon "flash" of large data. The specific characteristics are summarized as "light, fast, spirit and skillful".

Light: The Spark 0.6 core code has 20,000 lines, Hadoop 1.0 is 90,000 rows, and 2.0 is 220,000 rows. On the one hand, thanks to the simplicity and richness of the Scala language, spark has made good use of the infrastructure of Hadoop and Mesos (Berkeley's other project to enter the incubator, which focuses on the dynamic resource management of the cluster). Although very light, it is not compromised in fault-tolerant design. Matei, the creative person, said: "Do not treat errors as exceptions." The implication is that fault tolerance is part of the infrastructure. Fast: Spark can achieve a second-level delay for a small dataset, which is unthinkable for Hadoop MapReduce (MapReduce) (Because of the "heartbeat" interval mechanism, only a few seconds of delay for a task to start). For large datasets, the spark version is 10 times times faster than the implementations based on MapReduce, Hive, and Pregel for typical iterative machine learning, ad hoc query (hoc query), and graph computing. The memory calculation, data locality (locality) and transmission optimization, scheduling optimization, such as the Habitat Shoukong, also with the design at the beginning of the light weight concept is not unrelated. Spirit: Spark offers different levels of flexibility. At the implementation level, it perfectly interprets Scala trait dynamic mixing (mixins) policies (such as a replaceable cluster scheduler, a serialized library), and in the primitive (primitive) layer, which allows the extension of new data operators (operator), New data sources (such as HDFS support Dynamodb), new language bindings (Java and Python), and in the paradigm (PARADIGM) layer, spark supports a variety of paradigms such as memory calculation, multiple Iteration batch processing, ad hoc query, stream processing, and graph calculation. Qiao: Skillfully in occasion and borrow force. Spark borrowed Hadoop to seamlessly integrate with Hadoop, then shark (the Data Warehouse implementation on spark) borrowed Hive's potential; The graph calculates the API of Pregel and Powergraph and the point-splitting idea of powergraph. Everything comes with the potential of Scala (which is widely hailed as the future of Java): Spark programming look ' n ' feel is authentic Scala, whether it's syntax or API. In the realization, but also the ability to lend dexterity. To support interactive programming, spark only needs to make changes to Scala's shell (in contrast, Microsoft's interactive programming of MapReduce to support JavaScript console, not just across the Java and JavaScript thinking barriers, It will be a fuss to achieve.

Said a lot of benefits, or to point out that spark is not perfect. It has inherent limitations, it does not support fine-grained, asynchronous data processing, but also the next day, even if there is a great gene, after all, just started, in the performance, stability and the scalability of the paradigm there is a lot of space.

Computational Paradigms and abstractions

Spark first is a computational paradigm of coarse-grained data parallelism (parallel).

The difference between data parallelism and task parallel is embodied in the following two aspects.

The body of the

calculation is the data collection, not the individual data. The length of the set depends on the implementation, such as SIMD (single instruction multiple data) vector instruction is generally 4 to 64,GPU SIMT (single instruction multithreading) is generally 32,SPMD (single program multiple data) can be wider. Spark deals with large data, and therefore uses a very coarse set of particles called resilient distributed Datasets (RDD). All the data in the set passes through the same operator sequence. The data is well programmable, easy to obtain high parallelism (related to data scale, not to the parallelism of program logic), and can be easily mapped to the underlying parallel or distributed hardware. Traditional Array/vector programming languages, SSE/AVX intrinsics, Cuda/opencl, and Ct (c + + for throughput) belong to this category. The difference is that Spark's vision is the entire cluster, not a single node or parallel processor.

The paradigm of data parallelism determines that spark cannot be perfectly supported for fine-grained, asynchronous update operations. Graph computing has such operations, so at this time spark is not as good as graphlab (a large scale diagram computing framework); There are applications that require fine-grained log updates and checkpoint data, It is not as good as Ramcloud (Stanford's Memory Storage and Computing research project) and Percolator (Google incremental computing). This, in turn, allows spark to work the areas of application that it specializes in, and attempts to Dryad (Microsoft's early Big Data platform) have been unsuccessful.

Spark's Rdd, using the Scala collection type of programming style. It also uses functional semantics (functional semantics): One is closure, and the other is RDD. Logically, each RDD operator generates a new RDD, with no side effects, so the operator is called deterministic, and since all operators are idempotent, the operator sequence can only be performed when an error occurs.

The spark computation abstraction is a data flow and is a data stream with a working set (sharable set). Flow processing is a data flow model, MapReduce is also, the difference is that mapreduce need to maintain the working set in multiple iterations. The abstraction of a working set is common, such as multiple iterative machine learning, interactive data mining, and graph computing. To ensure fault tolerance, MapReduce uses stable storage (such as HDFS) to host working sets at a slow rate. Haloop uses a loop-sensitive scheduler to ensure that the reduce output of the previous iteration and the map input dataset for this iteration are on the same physical machine, which reduces network overhead but avoids disk I/O bottlenecks.

The breakthrough of Spark is to use memory to host working set under the premise of ensuring fault tolerance. Memory is accessed faster than multiple levels of disk, which can greatly improve performance. The key is to achieve fault tolerance, traditionally in two ways: logs and checkpoints. Considering that checkpoint has data redundancy and network communication overhead, spark uses log data update. Fine-grained log updates are not cheap, and spark is not good at it. Spark records Coarse-grained rdd updates, so overhead can be negligible. In view of the functional semantics and idempotent characteristics of spark, the fault tolerance by replaying log updates does not have any side effects.

Programming model

Look at a piece of code: the textfile operator reads the log file from the HDFs, returns "file" (RDD), the filter operator screens the line with "ERROR", assigns it to "errors" (new RDD), and the cache operator caches it for future use; the Count operator returns Number of rows for "errors". RDD does not look much different from the Scala collection type, but their data is quite different from the running model.

Figure 1 shows the RDD data model and maps the four operators used in the previous example to four operator types. The Spark program works in two spaces: Spark rdd Space and Scala native data space. In the native data space, the data is represented as a scalar (scalar, Scala basic type, in orange squares), a collection type (blue dotted box), and persistent storage (red cylinder).

Fig. 12 The switch of space, four kinds of different rdd operators

The input operator (orange arrow) inhales the Scala collection type or the data in the store into the Rdd space and into the RDD (Blue Solid box). Input operators have roughly two types of input: one for Scala collection types, such as parallelize, and another for storing data, as in the previous example, Textfile. The output of the input operator is the RDD of the spark space.

Because of the function semantics, Rdd transforms (transformation) operators (blue arrows) to generate a new rdd. The input and output of the transformation operator are RDD. The RDD will be divided into a number of partitions (partition) distributed to multiple nodes in the cluster, figure 1 with a small blue square to represent the partition. Note that partitioning is a logical concept and that the old and new partitions before and after the transformation are physically likely to be the same memory or storage. This is an important optimization to prevent the infinite expansion of memory requirements caused by functional invariance. Some rdd are intermediate results of the calculation, and their partitions do not necessarily have corresponding memory or storage corresponding to them, if necessary (such as for future use), you can call the cache operator (the example of the cache operator, Gray arrows) to the partition materialized (materialize) to save (gray box).

Some of the transformation operators see RDD elements as simple elements, divided into the following categories:


input output is one-to-one (element-wise) operator, and the result RDD partition structure is the same, which is mainly map, Flatmap (map flattening is one dimension rdd); The input output is one-to-one, but the result RDD partition structure has changed. such as union (two Rdd), coalesce (partition reduction), select some of the elements from the input operator, such as filter, distinct (remove redundant elements), subtract (this rdd has, it rdd elements left Behind) and sample (sample).

Another part of the transformation operator for the Key-value set, but also divided into:

a element-wise operation on a single rdd, such as mapvalues (which is different from the map), for a single rdd rearrangement, such as sort, rdd (partitioning of consistency, which is important for data locality optimization, A single RDD is reorganized and reduce based on key, such as Groupbykey, Reducebykey, and join and regroup for two RDD based on key, such as join, Cogroup.

The latter three types of operations involve a rearrangement, called the Shuffle class operation.

The sequence of transformation operators from RDD to RDD has been occurring in rdd space. The important design here is lazy evaluation: computing does not actually happen, it just keeps recording the metadata. The structure of the metadata is DAG (a direction-free graph) in which each "vertex" is rdd (including the operator that produces the RDD), and "side" from the parent Rdd to the child Rdd, representing the dependencies between the RDD. Spark gave the metadata dag a cool name, lineage (lineage). This lineage is also the log update described in the previous fault-tolerant design.

Lineage continues to grow until the action operator (the green arrow in Figure 1) is evaluate, and all the operators just accumulated are executed at once. The input of the action operator is RDD (and all rdd that the RDD relies on lineage), and the output is a native data generated after execution, possibly a Scala scalar, a collection type of data, or a store. When the output of an operator is the above type, the operator must be an action operator, whose effect is to return the native data space from the RDD space.

The action operator has the following classes: Generating a scalar, such as count (returns the number of elements in RDD), reduce, fold/aggregate (see Scala's operator document with the same name); returns several scalars, such as Take (returns the first few elements); such as collect (pour all elements in Rdd into Scala collection type), lookup (Find all values for the key), write storage, such as saveastext-file corresponding to the preceding text textfile. There is also a checkpoint operator checkpoint. When the lineage is exceptionally long (which occurs frequently in the graph calculation), it takes a long time to rerun the entire sequence when an error occurs, and you can actively invoke checkpoint to write the current data to the stable storage as a checkpoint.

Here are two design points. The first is lazy evaluation. Familiar with the compiler know that the compiler can see the larger scope, the more opportunities for optimization. Spark is not compiled, but the scheduler actually makes linear complexity optimizations for DAG. Especially when the spark has a variety of computational paradigms mixed, the scheduler can break the boundaries of different paradigm code for global scheduling and optimization. The following example mixes shark's SQL code with Spark's machine learning code. Each part of the code translated to the bottom Rdd, merged into a large DAG, so that more global optimization opportunities can be obtained.

Another important point is that once the operator generates native data, it must exit the RDD space. Because the current spark can only track Rdd computations, the calculation of native data is not visible to it (unless later spark will provide the overload, wrappers, or implicit conversion) of the native data type operation. This partially invisible code may introduce dependencies between Pre-and Rdd, such as the following code:

The third line of filter dependence on Errors.Count () is generated by the native data operation (CNT-1), but the scheduler does not see the operation, and there is a problem.

Because Spark does not provide a control flow, it must fall back to Scala's space when the computational logic requires conditional branching. Because the Scala language has strong support for custom control flow, it does not exclude future spark support.

Spark also has two very useful functions. One is the broadcast (broadcast) variable. Some data, such as a lookup table, may be reused across multiple jobs; These data are much smaller than rdd and should not be divided between nodes like Rdd. The solution is to provide a new language structure--broadcast variables--to modify such data. Spark the contents of the broadcast variable are sent to each node at runtime and saved for future reuse without sending. Compared to Hadoop's distributed cache, broadcast content can be shared across jobs. The spark submitter Mosharaf from the peer-to-peer old mage Ion Stoica, using a simplified implementation of BitTorrent (yes, that is, downloading the film's BT). Interested readers can refer to Sigcomm ' 11 thesis Orchestra. Another feature is accumulator (counter from MapReduce): Allow some global variables to be added to the spark code to do bookkeeping, such as recording current running metrics.

Running and scheduling

Figure 2 shows the running scenario for the spark program. It is initiated by the client, in two stages: the first Stage records transformation operator sequence, increment constructs dag graph; The second stage is triggered by the action operator, Dagscheduler transforms the Dag graph into the job and its task set. Spark supports local single node running (useful for development debugging) or cluster running. For the latter, the client runs on the master node and through Cluster Manager sends the partition's set of tasks to the cluster's Worker/slave node for execution.

Figure 2 Spark program running process

Spark has traditionally been associated with Mesos, or Amazon EC2 and yarn. The base class of the underlying task Scheduler is a trait, and its different implementations can be mixed into actual execution. For example, there are two scheduler implementations on Mesos, one that gives all the resources of each node to spark, and another that allows spark jobs to be scheduled with other jobs, sharing cluster resources. The worker node has a task thread that is actually running Dagscheduler-generated tasks, and the Block manager is responsible for blocks manager on Master Master Communications (the perfect use of Scala's actor mode) to provide data blocks for task threads.

The most interesting part is Dagscheduler. The following is a detailed explanation of its working process. A very important domain in the RDD data structure is the dependency on the parent rdd. As shown in Figure 3, there are two types of dependencies: narrow (narrow) dependencies and wide (Wide) dependencies.

Figure 3 Narrow dependencies and wide dependencies

Narrow dependency means that each partition of the parent RDD is used by a partition of a single child Rdd, which behaves as a partition of a parent rdd corresponding to a partition of a child rdd, and a partition of two parent rdd corresponding to a partition of a child rdd. In Figure 3, the Map/filter and the Union belong to the first class, and the join is divided into two groups (co-partitioned).

A RDD partition relies on all partitions of the parent RDD because shuffle class operations, such as the Groupbykey in Figure 3 and the join that are not coordinated.

Narrow dependencies are good for optimization. Logically, each RDD operator is a fork (this join is not a join operator above, but a barrier that synchronizes multiple parallel tasks): The calculation is fork to each partition, the join is completed, and then the next fork operator is RDD. If you translate directly into a physical implementation, is very not economical: one is each rdd (even the intermediate result) all needs to be materialized into the memory or the storage, the time-consuming expense space, second is the join as the global barrier, is very expensive, will be dragged by the slowest node to die. If the partition of the RDD is narrow dependent on the partition of the parent Rdd, the classical fusion optimization can be implemented, and the two fork are combined into one; if the successive transformation operator sequences are narrow-dependent, many fork can be used as one, Not only reduces a large number of global barrier, but also eliminates the need to materialize many intermediate result Rdd, this will greatly enhance the performance. Spark this is called pipelining (pipeline) optimization.

The transformation operator sequence meets the shuffle class operation, the wide dependence occurs, and the pipeline optimization terminates. In the concrete implementation, Dagscheduler from the current operator back to rely on the graph, one encounter wide dependencies, to generate a stage to accommodate the traversed operator sequence. In this stage, pipeline optimization can be safely implemented. Then, from that wide dependency, continue backtracking, generating the next stage.

To delve into two questions: first, how partitions are divided; This corresponds exactly to the other two domains in the RDD structure: partition division (Partitioner) and preferred location (preferred locations).

Partition partitioning is critical for shuffle class operations, which determines the type of dependency between the parent Rdd and the child rdd of the operation. As mentioned above, the same join operator, if co-ordinated, can form a consistent partitioning arrangement between the two parent Rdd, the parent Rdd and the Sub Rdd, that is, the same key guarantee is mapped to the same partition, which can form a narrow dependency. Conversely, if there is no synergistic division, resulting in wide dependency.

The so-called cooperative Division is the designation of Partition division to produce consistent zoning. Pregel and Haloop this as part of the system, and spark defaults to provide two kinds of partitions: Hashpartitioner and Rangepartitioner, allowing the program to be specified by the Partitionby operator. Note that Hashpartitioner can play a role, requiring key hashcode to be effective, that is, the same content key produces the same hashcode. This is true for string, but the array is not established (because the hashcode of the arrays is generated by its identity, not by the content). In this case, spark allows the user to customize the Arrayhashpartitioner.

The second problem is the node where the partition is placed, which is about data locality: Good locality and less network traffic. Some rdd have a preferred location when they are created, such as the Hadooprdd partition's preferred location is the node where the HDFs block resides. Some rdd or partitions are cached, and the computation should be sent to the node where the cache partition resides. Otherwise, backtracking Rdd lineage always finds the parent RDD with the preferred location attribute, and determines the placement of the child rdd accordingly.

The concept of wide/narrow dependencies is not only used in scheduling, but also useful for fault tolerance. If a node is down and the operation is narrow dependent, it is not dependent on the other nodes as long as the missing parent RDD partition is counted. It is expensive to have all the partitions that are dependent on the parent RDD. Therefore, if you use the checkpoint operator to do checkpoints, not only to consider whether the lineage is long enough, but also to consider whether there is a wide dependency, the wide-dependency plus checkpoint is the best value for money.


Because of the limitation of space, this article can only introduce the basic concept and design idea of Spark, which comes from spark papers (NSDI ' A "resilient distributed datasets:a fault-tolerant For as Cluster Computing "mainly", but also I and colleagues to study spark experience, and over the years engaged in parallel/distributed system research sentiment. Spark Core member/shark creator Sing the article for Review and revision, hereby thank you!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.