Reproduced Spark: "Milliseconds" of big data

Source: Internet
Author: User
Tags scalar shuffle hadoop mapreduce spark rdd

Reprinted from http://www.csdn.net/article/2013-07-08/2816149

Spark has formally applied to join the Apache incubator, from a flash of the laboratory "spark" to the emergence of a big data technology platform in the new sharp. This article focuses on the design ideas of spark. In its name, spark shows the uncommon "milliseconds" of big data. The specific features are summarized as "light, fast, spiritual and skillful".

    • Light : The Spark 0.6 core code has 20,000 lines, Hadoop 1.0 is 90,000 lines, and 2.0 is 220,000 rows. On the one hand, thanks to the concise and expressive power of the Scala language, spark has made good use of the infrastructure of Hadoop and Mesos (another project in Berkeley that specializes in cluster dynamic resource management). Although it is very light, it does not compromise on fault-tolerant design. The creator Matei claims: "Do not treat mistakes as exceptions." "The implication is that fault tolerance is part of the infrastructure.
    • fast : Spark can achieve sub-second latency for small datasets, which is unthinkable for Hadoop MapReduce (MapReduce) (due to the "heartbeat" interval mechanism, the task starts with a delay of several seconds). For large datasets, the spark version is 10 times times faster than MapReduce, Hive, and Pregel implementations for typical iterative machine learning, ad hoc queries (Ad-hoc query), and graph calculations. Among them, memory computing, data locality (locality) and transmission optimization, scheduling optimization, such as the handcraft, but also at the beginning of the design of the light-weight concept is not irrelevant.
    • Spirit : Spark offers different levels of flexibility. At the implementation level, it perfectly interprets the Scala trait Dynamic Mix (mixin) strategy (such as a replaceable cluster scheduler, a serialized library), and at the primitive (Primitive) level, it allows the expansion of new data operators (operator), New data sources (such as DYNAMODB support outside of HDFs), new language bindings (Java and Python), and in the paradigm (PARADIGM) layer, spark supports multiple paradigms such as memory computing, multi-Iteration batch processing, ad hoc querying, stream processing, and graph computing.
    • Qiao : Skillfully in occasion and borrowing power. With Hadoop, spark seamlessly integrates with Hadoop, and then shark (data Warehouse implementation on spark) borrows the potential of hive, which calculates the API for borrowing Pregel and powergraph and the idea of Powergraph point segmentation. Everything is based on Scala, widely hailed as a future replacement for Java: The look ' n ' feel of Spark programming is the original Scala, whether it's a grammar or an API. In the realization, but also smart to borrow power. To support interactive programming, spark only needs to make changes to Scala's shell (in contrast, Microsoft's interactive programming with MapReduce to support JavaScript console not only spans Java and JavaScript's thought barriers, In the implementation of the further).

To say a lot of good, or to point out that spark is not perfect. It has innate limitations, does not support fine-grained, asynchronous data processing well, and for the day after, even if there is a great gene, after all, there is still a lot of room for performance, stability and paradigm scalability.

Computational Paradigms and abstractions

Spark is first a computational paradigm of coarse-grained data parallelism (parallel).

The difference between data parallelism and task parallelism (task parallel) is reflected in the following two aspects.

    • The body of the calculation is a collection of data, not individual data. The length of the collection depends on implementation, such as SIMD (single instruction multi-data) vector instruction is generally 4 to 64,gpu of SIMT (single instruction multithreading) is generally 32,SPMD (single program multi-data) can be wider. Spark handles big data, so it uses a coarse-grained set called resilient distributed Datasets (RDD).
    • All the data in the collection goes through the same operator sequence. Data parallelism is good, easy to achieve high parallelism (related to data size, not to the parallelism of program logic), and easily and efficiently mapped to the underlying parallel or distributed hardware. Traditional Array/vector programming languages, SSE/AVX intrinsics, Cuda/opencl, Ct (c + + for throughput) belong to this class. The difference is that Spark's vision is the entire cluster, not a single node or parallel processor.

The paradigm of data parallelism determines that spark cannot perfectly support fine-grained, asynchronously updated operations. Figure calculations have such operations, so at this point spark is not as Graphlab (a large-scale graph computing framework), and there are applications that require fine-grained log updates and data checkpoints. It's not as good as Ramcloud (Stanford's Memory Storage and Computing research project) and Percolator (Google incremental computing). This, in turn, makes it possible for spark to work the areas it excels at, trying to make the Dryad (Microsoft's early Big data platform) rather unsuccessful.

The RDD of Spark uses the Scala collection type programming style. It also uses the functional semantics (functional semantics): One is closure, the other is the non-modification of the RDD. Logically, each RDD operator generates a new RDD with no side effects, so operators are also known as deterministic, and because all operators are idempotent, the operator sequence is simply re-executed when an error occurs.

Spark's computational abstraction is data flow and is a data stream with working set (working set). Stream processing is a data flow model, and MapReduce is the difference that MapReduce needs to maintain working sets in multiple iterations. The abstraction of working sets is common, such as multi-iteration machine learning, interactive data mining, and graph computing. To ensure fault tolerance, MapReduce uses stable storage (such as HDFS) to host working sets at the expense of slow speed. The haloop uses a cyclic-sensitive scheduler that ensures that the reduced output of the previous iteration and the map input datasets for this iteration are on the same physical machine, which reduces network overhead but does not prevent disk I/O bottlenecks.

Spark's breakthrough is to use memory to host working sets, in the context of guaranteed fault tolerance. Memory is accessed faster than multiple orders of disk, which can greatly improve performance. The key is to implement fault tolerance, traditionally there are two methods: Log and checkpoint. Spark uses log data updates, considering that checkpoints have the overhead of data redundancy and network traffic. Fine-grained log updates are not cheap, and, as mentioned before, Spark is not good at it. Spark records a coarse-grained rdd update so that the overhead can be negligible. Given the functional semantics and idempotent nature of Spark, there is no side effect by replaying log updates for fault tolerance.

Programming model

Take a look at the code: the textfile operator reads the log file from HDFs, returns a "file" (RDD), filters the line with "ERROR", assigns it to "errors" (new RDD), caches it for future use, and the Count operator returns The number of rows for "errors". The RDD does not look much different from the Scala collection types, but their data and running models are very different.

Figure 1 shows the RDD data model and maps the four operators used in the previous example to four operator types. The Spark program works in two spaces: Spark Rdd Space and Scala native data space. In the native data space, the data is represented as scalar (scalar, the Scala basic type, represented by a small orange square), a collection type (a blue dashed box), and a persistent store (a red cylinder).

Figure 12 Switching of space, four different types of RDD operators

The input operator (orange arrow) sucks the data from the Scala collection type or storage into the RDD space and into the RDD (solid blue wireframe). There are roughly two types of input operator inputs: one for Scala collection types, such as parallelize, and the other for storing data, such as textfile in the previous example. The output of the input operator is the RDD of the spark space.

Because of the function semantics, the RDD passes the transform (transformation) operator (blue arrow) to generate a new rdd. The input and output of the transform operator are rdd. The RDD is divided into a number of partitions (partition) distributed across multiple nodes of the cluster, and figure 1 represents the partition with a small blue square. Note that partitioning is a logical concept, and the old and new partitions before and after the transformation may be physically the same piece of memory or storage. This is an important optimization to prevent the infinite expansion of memory requirements caused by functional invariance. Some rdd is the intermediate result of the calculation, and its partition does not necessarily have corresponding memory or storage corresponding to it, if necessary (for future use), you can call the cache operator (the cache operator in the example, the gray arrow indicates) the partition materialization (materialize) to save (Gray square).

Some of the transformation operators consider the elements of the RDD as simple elements, divided into the following categories:

    • The input and output one-to (element-wise) operators, and the result of the RDD partition structure is unchanged, mainly map, FlatMap (map after the flat for a-dimensional rdd);
    • The input and output are one to the other, but the result is that the partition structure of the RDD has changed, such as Union (two Rdd together), coalesce (partition reduction);
    • The operator that selects a part of the element from the input, such as filter, distinct (remove redundant elements), subtract (this rdd has, the element it has no rdd), and sample (sample).

Another part of the transformation operator for the Key-value collection, but also divided into:

    • Perform element-wise operations on a single rdd, such as mapvalues (keeping the source Rdd partitioned, which is different from map);
    • Rearrangement of a single rdd, such as sort, Partitionby (partition partitioning for consistency, which is important for data locality optimization, will be said later);
    • A single RDD is based on key reorganization and reduce, such as Groupbykey, Reducebykey;
    • Join and reorganize two RDD based on key, such as join, Cogroup.

The latter three types of operations involve rearrangement, called shuffle class operations.

The sequence of transformation operators from RDD to RDD has been occurring in the Rdd space. The important design here is the lazy evaluation: The calculation does not actually occur, but it is continuously logged to the meta data. The structure of the metadata is a dag (directed acyclic graph), where each "vertex" is an RDD (including the operator that produces the RDD), from the parent Rdd to the child Rdd with an "edge" that represents the dependency between the RDD. Spark gives the metadata dag a cool name, lineage (lineage). This lineage is also the log update described in the previous fault tolerant design.

The lineage continues to grow until the action operator (the green arrow in Figure 1) is evaluate, and all the operators just accumulated are executed once. The input to the action operator is the RDD (and all the RDD that the RDD relies on on the lineage), which is the native data generated after execution, possibly Scala scalar, collection-type data, or storage. When the output of an operator is the above type, the operator must be an action operator, and the effect is to return the original data space from the RDD space.

There are several types of action operators: generating scalars, such as COUNT (returning the number of elements in the RDD), reduce, fold/aggregate (see Scala operator documentation with the same name), returning several scalars, such as take (returning the first few elements), generating Scala collection types, such as collect (pour all the elements in the RDD into the Scala collection type), lookup (find all the values of the corresponding key), write to the store, such as the saveastext-file corresponding to the previous text textfile. There is also a checkpoint operator checkpoint. When the lineage is particularly long (which happens frequently in graph calculations), it takes a long time to re-execute the entire sequence on error, and you can actively call checkpoint to write the current data to the stable store as a checkpoint.

Here are two main points of design. The first is the lazy evaluation. Familiarity with compiling knows that the larger the scope the compiler can see, the more opportunities to optimize. Although Spark does not compile, the scheduler actually optimizes the DAG for linear complexity. In particular, when there are multiple computational paradigms mixed on spark, the scheduler can break the boundaries of different paradigm codes for global scheduling and optimization. The following example mixes shark's SQL code with Spark's machine learning code. After each part of the code is translated into the underlying RDD, it is fused into a large DAG, allowing for more global optimization opportunities.

Another important point is that once the action operator generates native data, it must exit the RDD space. Since Spark is only able to track the calculation of the RDD, the computation of the native data is not visible to it (unless Spark will later provide overloads, wrapper, or implicit conversion for native data type operations). This partially invisible code may introduce dependencies between the front and back Rdd, as in the following code:

The third line of filter dependence on Errors.Count () is generated by the native data operation (CNT-1), but the scheduler does not see this operation, it will be a problem.

Because Spark does not provide a control flow, it must also fall back into Scala's space when the compute logic requires conditional branching. Because the Scala language has strong support for custom control flows, it does not preclude future spark support.

Spark also has two very useful features. One is the broadcast (broadcast) variable. Some data, such as a lookup table, may be reused across multiple jobs, which is much smaller than the rdd and should not be divided between nodes like an RDD. The solution is to provide a new language structure-a broadcast variable-to decorate this kind of data. The spark runtime sends the contents of the broadcast variable to each node and saves it in the future without needing to send it again. Compared to Hadoop's distributed cache, broadcast content can be shared across jobs. The Spark submitter, Mosharaf, is a simplified implementation of the BitTorrent (that is, the BT that downloads the movie) from the peer to the old mage Ion Stoica. Interested readers can refer to Sigcomm ' 11 's paper Orchestra. Another feature is accumulator (counter from MapReduce): Allows spark code to include some global variables to do bookkeeping, such as recording the current running metrics.

Run and Dispatch

Figure 2 shows the running scenario for the spark program. It is initiated by the client, divided into two stages: the first stage records the transformation operator sequence, the increment constructs the Dag graph, the second stage is triggered by the action operator, dagscheduler the Dag graph into the job and its task set. Spark supports local single-node running (useful for development debugging) or cluster operation. For the latter, the client runs on the master node and sends the partitioned task set to the Worker/slave node of the cluster through Cluster Manager.

Figure 2 Spark program running process

Spark traditionally and mesos "Jiao Meng", also can support Amazon EC2 and yarn. The base class of the underlying task Scheduler is a trait, and its different implementations can be mixed with actual execution. For example, there are two scheduler implementations on Mesos, one that assigns all the resources of each node to spark, and one that allows the spark job to dispatch and share cluster resources with other jobs. The task thread on the worker node actually runs the task generated by the Dagscheduler, and the Block manager is responsible for working with the block manager on Master Master Communication (the perfect use of Scala's actor mode) provides a block of data for the task thread.

The most interesting part is the Dagscheduler. The following is a detailed description of its work process. An important domain in the data structure of the RDD is the dependency on the parent rdd. As shown in 3, there are two types of dependencies: Narrow (Narrow) dependency and wide (Wide) dependency.

Figure 3 Narrow dependency and wide dependency

Narrow dependency refers to a partition where the parent RDD is used by a partition of up to one child rdd, and the partition of the parent RDD corresponds to the partition of a child rdd, and the partition of the two parent Rdd corresponds to a sub-rdd. In Figure 3, Map/filter and union belong to the first class, and the join that co-divides the input (co-partitioned) belongs to the second class.

The partitioning of a wide-dependent child rdd is dependent on all partitions of the parent RDD, because of the shuffle class operation, the Groupbykey in 3, and the join without the co-partitioning.

Narrow reliance is advantageous for optimization. Logically, each RDD operator is a fork/join (this join is not a join operator above, but a barrier that synchronizes multiple parallel tasks): Fork the computation to each partition, calculate the join, and then fork/join the next Rdd operator. If you translate directly to the physical implementation, it is very economical: first, each rdd (even the intermediate results) need to be materialized into memory or storage, time-consuming space, and second, join as a global barrier, is very expensive, will be the slowest node dragged dead. If the partition of the sub-Rdd to the parent RDD is narrow-dependent, the classic fusion optimization can be implemented, and the two fork/join are combined into one, and if the sequence of successive transformation operators is narrow-dependent, many of the fork/join can be used as a single, It not only reduces the number of global barrier, but also eliminates the need to materialize many intermediate result Rdd, which greatly improves performance. Spark calls this pipeline (pipeline) optimization.

When the transformation operator sequence meets the shuffle class operation, the wide dependency occurs, and the pipeline optimization terminates. In the concrete implementation, Dagscheduler from the current operator back to the dependency graph, a wide dependency, a stage is generated to accommodate the traversed operator sequence. In this stage, pipeline optimization can be implemented safely. Then, starting from that wide dependency, we continue to backtrack, generating the next stage.

To drill down to two questions: first, how partitions are divided, and what nodes the partitions should be placed in the cluster. This corresponds to the other two domains in the RDD structure: the partition division (Partitioner) and the preferred location (preferred locations).

Partitioning is critical for the Shuffle class operation, which determines the type of dependency between the parent Rdd and the child rdd of the operation. As mentioned above, the same join operator, if coordinated, can form a consistent partitioning arrangement between the two parent Rdd, the parent rdd, and the child Rdd, that is, the same key is guaranteed to be mapped to the same partition, which can form a narrow dependency. Conversely, if there is no co-partitioning, it leads to wide dependency.

The so-called co-partitioning is the designation of partition divisions to produce consistent partitioning arrangements. Pregel and Haloop this as part of the system's built-in, while Spark provides two types of partitions by default: Hashpartitioner and Rangepartitioner, allowing the program to be specified by the Partitionby operator. Note that Hashpartitioner can play a role in requiring key hashcode to be valid, that is, the same content key produces the same hashcode. This is true for strings, but not for arrays (because the hashcode of an array is generated by its identity, not by its content). In this case, spark allows the user to customize the Arrayhashpartitioner.

The second problem is the partitioning of the nodes, which is related to data locality: Good local, less network communication. Some rdd have a preferred location, such as the preferred location for the Hadooprdd partition, which is the node where the HDFs block resides. Some rdd or partitions are cached, and the calculation should be sent to the node where the cache partition resides. Or else, back to the Rdd. Lineage always finds the parent RDD with the preferred Location property, and accordingly determines the placement of the child Rdd.

The concept of wide/narrow dependencies is not only used in scheduling, but is also useful for fault tolerance. If a node goes down and the operation is narrow-dependent, it is only possible to re-compute the lost parent RDD partition and not rely on the other nodes. It is expensive to have all the partitions that rely on a parent rdd for wide dependencies. So if you use the checkpoint operator to do the checkpoint, not only to consider whether the lineage is long enough, but also to consider whether there is a wide dependency on the wide dependency plus checkpoint is the best value for money.

Conclusion

Because of the limitations of space, this article can only introduce the basic concept of spark and design ideas, the content from a number of spark papers (to NSDI ' "Resilient distributed datasets:a fault-tolerant abstraction For In-memory Cluster Computing "as a main", I have also worked with my colleagues on Spark's experience, as well as over the years engaged in parallel/distributed system Research insights. The Spark core member/shark the creator Sing to review and revise this article, hereby thanks!

Spark stands at a high starting point, with lofty goals, but its journey is just beginning. Spark is committed to building an open ecosystem (http://spark-project.org/https://wiki.apache.org/incubator/SparkProposal) and is willing to work with you!

Reproduced Spark: "Milliseconds" of big data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.