Spark has been officially applied to join the Apache incubator, and has grown from the Spark Lab in the flash of the ghost machine to a new frontier in the big data technology platform. This article describes the design concept of spark. Spark, like its name, shows the "Electro-optic stone fire" that is not common in big data ". The specific features are summarized as "light, fast, smart, and clever ".
Light: Spark 0.6 has 20 thousand lines of core code, hadoop 1.0 is 90 thousand lines, and 2.0 is 0.22 million lines. On the one hand, thanks to the concise Scala language and the rich expressiveness; on the other hand, spark makes good use of hadoop and mesos (another Berkeley entry incubator project focuses on Dynamic Resource Management of clusters) infrastructure. Although very light, the fault tolerance design is not compromised. Matei, the Creator, said: "Do not treat errors as special cases ." In other words, fault tolerance is part of the infrastructure.
Fast: Spark can achieve sub-second-level latency for small data sets, which is unimaginable for hadoop mapreduce (mapreduce) (due to the "Heartbeat" interval mechanism, task startup only has a delay of several seconds ). For big data sets, for typical iterative machine learning, ad-hoc queries, and graph computing applications, the spark version is 10 to times faster than mapreduce, hive, and Pregel. Among them, memory computing, local data (locality) and transmission optimization, scheduling optimization, and so on are not unrelated to the lightweight concept that was adhered to at the beginning of the design.
Ling: Spark provides different levels of flexibility. At the implementation layer, it perfectly interprets Scala trait dynamic mixing (Mixin) policies (such as replaceable cluster schedulers and serialization libraries); At the primitive layer, it allows expansion of new operators, new data sources (such as dynamodb supported outside HDFS), new language bindings (Java and Python); At the paradigm layer, spark supports multiple paradigms such as memory computing, multi-iteration batch processing, ad hoc queries, stream processing, and graph computing.
Qiao: Leverage your potential and strength. Spark uses the trend of hadoop to seamlessly integrate with hadoop; then shark (data warehouse implementation on spark) leverages hive; graph computing uses the APIs of Pregel and powergraph and the vertex separation idea of powergraph. Everything has taken advantage of scala (widely known as the future replacement of Java): Look 'n' feel in Spark programming is the original Scala, whether it is syntax or API. In terms of implementation, you can also use your flexibility. To support interactive programming, spark only needs to make small modifications to Scala's shell (in contrast, Microsoft not only needs to bridge the thinking barriers of Java and JavaScript to support the Javascript console for mapreduce interactive programming, ).
After talking about a lot of advantages, we still need to point out that spark is not perfect. It has inherent limitations and does not support fine-grained and asynchronous data processing. There are also reasons for the future. Even if there are great genes, after all, they are just getting started, there is still much room for scalability in terms of performance, stability, and paradigm.
Computing paradigm and abstraction
Spark is first a computing paradigm of Data Parallel.
The difference between data parallelism and task parallelism lies in the following two aspects.
The subject of computing is a data set, not individual data. The length of the Set depends on the implementation, such as SIMD (single command multiple data) vector commands are generally 4 to 64, gpu simt (single command multithreading) is generally 32, SPMD (single-program multi-data) can be wider. Spark processes big data and therefore uses a coarse-grained set called resilient distributed datasets (RDD ).
All data in the set goes through the same operator sequence. High concurrency (related to the data size, rather than the concurrency of Program Logic) is also easy to efficiently map to the underlying parallel or distributed hardware. Traditional array/vector programming languages, SSE/avx intrinsics, Cuda/opencl, and CT (C ++ for throughput) all belong to this type. The difference is that spark's vision is the entire cluster, not a single node or parallel processor.
The Data Parallel paradigm determines that spark cannot perfectly support fine-grained and asynchronous update operations. Graph computing has such operations, so spark is not as good as graphlab (a large-scale graph computing framework) at this time. There are also some applications that require fine-grained log updates and data checkpoints, it is not as good as ramcloud (Stanford's memory storage and Computing Research Project) and percolator (Google incremental computing technology ). This, in turn, enables spark to carefully work on the application fields it is good at. It is not very successful to try to use Dryad (Microsoft's early big data platform.
Spark RDD adopts the scala set programming style. It also uses Functional Semantics (Functional Semantics): closure, and RDD unmodifiable. Logically, each RDD operator generates a new RDD with no side effects. Therefore, the operator is also called deterministic. Because all operators are idempotent, when an error occurs, you only need to re-execute the operator sequence.
Spark computing abstraction is a data stream with a working set. Stream processing is a data stream model, and mapreduce also has the difference that mapreduce needs to maintain the working set in multiple iterations. The abstract of working sets is common, such as multi-iteration machine learning, interactive data mining, and graph computing. To ensure fault tolerance, mapreduce uses stable storage (such as HDFS) to carry the working set, at the cost of being slow. Haloop uses a cyclically sensitive scheduler to ensure that the reduce output of the previous iteration and the map input dataset of this iteration are on the same physical machine. This reduces network overhead, however, disk I/O bottlenecks cannot be avoided.
The breakthrough of spark is that it uses memory to carry the working set while ensuring fault tolerance. The memory access speed is faster than the number of disks, which can greatly improve the performance. The key is fault tolerance. Traditionally, there are two methods: logs and checkpoints. Spark uses log data updates because the checkpoint has data redundancy and network communication overhead. Fine-grained log update is not cheap, and spark is not good at it as mentioned earlier. Spark records coarse-grained RDD updates, so the overhead is negligible. In view of Spark's Functional Semantics and power features, it can be fault-tolerant by replaying log updates without any side effects.
Programming Model
Let's look at a piece of code: The textfile operator reads log files from HDFS and returns "file" (RDD); the filter operator filters out rows with "error" and assigns them to "errors" (New RDD ); the cache operator caches the data for future use. The Count operator returns the number of "errors" rows. RDD seems similar to Scala set types, but their data and running model are quite different.
Figure 1 shows the RDD data model and maps the four operators used in the preceding example to four types of operators. Spark programs work in two spaces: spark RDD space and Scala native data space. In the native data space, the data is represented by scalar (scalar, that is, scala basic type, represented by an orange small square), set type (blue dotted box), and persistent storage (Red cylindrical ).
Figure 1 switching between two spaces, four different RDD Operators
The input operator (orange arrow) inhale the scala set type or stored data into the RDD space and convert it to RDD (blue solid box ). There are two types of input operators: parallelize and textfile in the preceding example. The output of the input operator is the RDD of the spark space.
Because of the function semantics, RDD generates a new RDD through the transformation operator (Blue Arrow. The input and output of the transform operator are both RDD. RDD is divided into many partitions and distributed to multiple nodes in the cluster. Figure 1 uses a small blue square to represent the partition. Note: partition is a logical concept. The new and old partitions before and after the transformation may be physically the same memory or storage. This is an important optimization to prevent infinite expansion of memory requirements caused by function-type immutability. Some RDD is the intermediate computing result, and its partitions do not necessarily correspond to the corresponding memory or storage. If necessary (such as for future use ), you can call the cache operator (in this example, the cache operator, indicated by a gray arrow) to save the partition materialized (materialize ).
Some transform operators regard the elements of RDD as simple elements, which are divided into the following categories:
The one-to-one (element-wise) operator is input and output, and the partition structure of the result RDD remains unchanged, mainly map and flatmap (after map, it is flattened into one-dimensional RDD );
The input and output are one-to-one, but the partition structure of the result RDD has changed, for example, Union (the two RDD are combined into one) and coalesce (the partition is reduced );
Select operators of some elements from the input, such as filter, distinct (redundant elements are removed), subtract (Elements of the RDD, RDD), and sample ).
Another part of the conversion operators for the key-value set is divided:
Perform element-wise operations on a single RDD, such as mapvalues (keep the source RDD partition mode, which is different from map );
Rearranging a single RDD, such as sort and partitionby (to achieve consistent partition division, this is very important for local data optimization, which will be discussed later );
Reorganize and reduce a single RDD Based on keys, such as groupbykey and performancebykey;
Join and reorganize two RDDs Based on keys, such as join and cogroup.
The last three operations involve shuffling, which is called shuffle operations.
The sequence of transformation operators from RDD to RDD occurs in the RDD space. The most important design here is lazy evaluation: computing does not actually happen, but keeps recording metadata. The metadata structure is DAG (directed acyclic graph), where each "vertex" is RDD (including the operator that produces the RDD), and there is an "edge" from the parent RDD to the Child RDD ", dependencies between RDDs. Spark gave the metadata Dag a cool name, lineage ). This lineage is also the log update mentioned in the fault tolerance design.
Lineage continues to grow until the action operator (the Green Arrow in Figure 1) is met, evaluate is required, and all operators that have just been accumulated are executed at one time. The input of the action operator is RDD (and all RDD that the RDD depends on the lineage). The output is the native data generated after execution, which may be Scala scalar, set data, or storage. When the output of an operator is of the above type, this operator must be an action operator, and its effect is to return the native data space from the RDD space.
Action operators include the following types: Generate scalar values, such as Count (return the number of elements in RDD), reduce, fold/aggregate (see Scala operator documentation for the same name), and return several scalar values, for example, take (return the first few elements); generate Scala set types, such as collect (put all elements in RDD into Scala set types) and Lookup (find all values of the corresponding key ); write to storage, such as the saveastext-file corresponding to the textfile. There is also a checkpoint operator checkpoint. When the lineage is particularly long (This often occurs in graph computing), it takes a long time to re-execute the entire sequence when an error occurs. You can actively call checkpoint to write the current data to the stable storage and use it as a checkpoint.
There are two design points. The first is lazy evaluation. All those familiar with compilation know that the larger the scope that the compiler can see, the more opportunities there will be for optimization. Although spark has not been compiled, the scheduler actually optimizes the linear complex degree of Dag. In particular, when spark has a mixture of multiple computing paradigms, the scheduler can break the boundaries of different paradigm codes for global scheduling and optimization. In the following example, the SQL code of shark is mixed with the machine learning code of spark. After the code is translated to the underlying RDD, it is integrated into a large Dag, which gives you more opportunities for global optimization.
Another key point is that once the operator generates native data, it must exit the RDD space. Currently, spark can only track RDD computing, and Native data computing is invisible to it (unless spark will provide native data-type operation overloading, wrapper, or implicit conversion in the future ). This invisible code may introduce the dependency between the preceding and following RDDs, as shown in the following code:
The dependency on errors. Count () in the third row of the filter is produced by the native data operation (cnt-1), but the scheduler does not see this operation, then there will be problems.
Because spark does not provide control flow, it must be rolled back to Scala space when the computing logic requires conditional branches. The Scala language has strong support for custom control flows, and spark will also support it in the future.
Spark also has two very useful functions. One is the broadcast (broadcast) variable. Some data, such as Lookup tables, may be used repeatedly among multiple jobs. The data is much smaller than RDD, and it is not recommended to divide nodes like RDD. The solution is to provide a new language structure-broadcast variables to modify such data. When spark runs, it sends the content modified by the broadcast variable to each node and saves it. In the future, it does not need to be sent again. Compared with hadoop's distributed cache, broadcast content can be shared across jobs. Mosharaf, a spark submitter, uses BitTorrent (that is, the BT for downloading movies) to simplify the implementation of ion Stoica, a P2P veteran. Interested readers can refer to the sigcomm '11 paper orchestra. Another feature is accumulator (from mapreduce's counter): allows spark code to add some global variables for bookkeeping, such as recording the current running metrics.
Run and schedule
Figure 2 shows the running scenario of the spark program. It is started by the client and has two phases: the first phase records the transformation operator sequence and incremental build Dag diagram; the second stage is triggered by the action operator, and dagscheduler converts the Dag diagram into a job and its task set. Spark supports running on a single local node (useful for development and debugging) or a cluster. For the latter, the client runs on the master node and sends the sharded task set to the worker/Slave node of the cluster through cluster manager for execution.
Figure 2 spark program running process
Spark has traditionally been "inseparable" from mesos, and can also support Amazon EC2 and yarn. The basic class of the underlying task scheduler is a trait. Different implementations of the scheduler can be mixed into actual execution. For example, there are two types of schedulers on mesos: one is to allocate all the resources of each node to spark, and the other is to allow spark jobs to schedule and share cluster resources with other jobs. There are task threads on the worker node that actually run the tasks generated by dagscheduler, and block manager) communicates with the block manager master on the master node (perfectly utilizes Scala's actor mode) to provide data blocks for task threads.
The most interesting part is dagscheduler. The following describes how it works. A very important field in the RDD data structure is the dependency on the parent RDD. As shown in 3, there are two types of Dependencies: narrow dependency and wide dependency.
Figure 3 narrow dependency and wide dependency
Narrow dependency indicates that each partition of the parent RDD can be used by a maximum of one RDD partition. This means that the partition of the parent RDD corresponds to the partition of the Child RDD, and the partitions of the two parent RDDs correspond to the partitions of the Child RDD. In Figure 3, MAP/filter and Union belong to the first class, and join for co-partitioned input belongs to the second class.
The wide dependency refers to that the sub-RDD partition depends on all the partitions of the parent RDD. This is because of shuffle class operations, groupbykey in 3 and join without any cooperative division.
Narrow dependencies are advantageous for optimization. Logically, each RDD operator is a fork/join (this join operator is not the preceding join operator, but the barrier that synchronizes multiple parallel tasks): fork calculation to each partition, after calculation, join and then fork/join the next RDD operator. It is very economic to directly translate to physical implementation: First, every RDD (even intermediate results) needs to be materialized into memory or storage, which takes time and money; second, as a global barrier, join is very expensive and will be dragged to death by the slowest node. If the sub-RDD partition to the parent RDD partition is narrow dependent, You can implement the classic fusion optimization and combine the two fork/join into one; if the continuous transformation operator sequence is narrow dependent, we can combine multiple fork/join into one, which not only reduces a large number of global barriers, but also does not need to materialized many intermediate results RDD, which will greatly improve the performance. Spark calls this pipeline optimization.
As soon as the sequence of transform operators encounters shuffle operations, the wide dependency occurs and the pipeline optimization ends. In specific implementation, dagscheduler traces the dependency graph from the current operator. Once a wide dependency is met, a stage is generated to accommodate the operator sequence that has been traversed. In this stage, you can securely implement pipeline optimization. Then, start tracing from the wide dependency to generate the next stage.
Two questions need to be explored: 1. How to divide partitions; 2. Which node should the partition be placed in the cluster. This corresponds to the other two fields in the RDD structure: partitioner and preferred locations ).
Partitioning is critical to shuffle operations. It determines the dependency type between the parent RDD of the operation and the Child RDD. As mentioned above, if a join operator is divided collaboratively, a partition arrangement can be formed between two parent RDDs, and between parent RDDs and child RDDs, that is, the same key must be mapped to the same partition to form a narrow dependency. Otherwise, the wide dependency is caused by no collaborative division.
The so-called collaborative partitioning means specifying a partition splitter to produce consistent partition arrangements. Pregel and haloop use this as part of the built-in system. Spark provides two default splitters by default: hashpartitioner and rangepartitioner, allowing programs to be specified through the partitionby operator. Note that hashpartitioner can play a role and requires that the hashcode of the key be valid, that is, the key of the same content generates the same hashcode. This is true for the string, but not for the Array (because the hashcode of the array is generated by its identifier rather than the content ). In this case, spark allows you to customize arrayhashpartitioner.
The second problem is the node where the partition is placed, which is related to local data: local data is good, and network communication is less. Some RDD instances have a preferred location when they are generated. For example, the primary location of the hadooprdd partition is the node where the HDFS block is located. If some RDD or partition is cached, the calculation should be sent to the node where the cache partition is located. Then, we will trace the lineage of RDD to find the parent RDD with the preferred location attribute, and determine the placement of the child RDD accordingly.
The concept of wide/narrow dependency is not only used in scheduling, but also useful for fault tolerance. If a node goes down and the computation is narrow, you only need to recalculate the lost parent RDD partition, and there is no dependency with other nodes. The wide dependency requires that all partitions of the parent RDD exist, and the re-calculation is very expensive. Therefore, if you use the checkpoint operator for checkpoint, you should not only consider whether the lineage is long enough, but also whether there is a wide dependency. Adding a checkpoint to the wide dependency is the most valuable.
Conclusion
Due to space limitations, this article only introduces the basic concepts and design concepts of spark. The content is from multiple spark papers (using NSDI '12 "resilient distributed datasets: A fault-tolerant distributed action for In-memory cluster computing). I have also worked with my colleagues on spark and have learned how to study parallel/distributed systems over the years. Xin Yu, a core spark Member/shark creator, has reviewed and modified this article. Thank you!
Spark has a high starting point and a high goal, but its journey has just begun. Spark is committed to building an open ecosystem (http://spark-project.org/https://wiki.apache.org/incubator/SparkProposal) and is willing to work with everyone!