Background
In the previous article, we introduced the incremental computing nature of galaxy, which is managed internally by the framework and is a simple contrast to storm. This article will tell you more about the Galaxy Incremental model and describe the Galaxy SQL and Galaxy Operator implemented on top of this incremental model, while comparing the spark streaming from an incremental perspective.
Galaxy MRM Delta with spark streaming
The MRM model is all called Mapreducemerge and does a merge operation over MapReduce. The merge phase can interact with state, read and write the oldvalue of a key, and the merge interface has rollback semantics. In the flow calculation scenario, the data on time or the number of pieces into different batches, the batch can do the common sense of mapreduce operations, the batch between the need to do cross-batch aggregation calculation. You can compare the updatestatebykey operation of Spark streaming, within a dstream, the RDD (i.e. batches) within each time period can update the state within the task by this interface. And the Galaxy merge is essentially an add process, the corresponding rollback is a delete process, from the database semantics, the two process is equivalent to the update operation, and these two processes are based on a primary key to do, So it's the same thing that spark streaming updatestatebykey do, but there's a big difference between the two.
The state of Galaxy exposed to the compute task is thread-level exclusive, and the state of spark streaming is globally shared within the task. The advantage of thread-level exclusivity is that the same batch of data, followed by Key shuffle, comes to different merge compute nodes, each of which does not block the respective calculation process, while the updatestatebykey operation of spark streaming blocks the calculation of other RDD Although the spark streaming can perform the Dstream within the various rdd, it eventually falls to the block of time series as long as there is a state operation. The computation of this point-in-time STATERDD relies on the calculation results of the previous point parent Staterdd, and each key in the batch is blocked and affected by the state operation, so the galaxy's merge process is more granular with the focus on this layer barrier. The add and delete processes are separate, and the key within the batch is computed on a different thread and the state is exclusive within the thread.
Galaxy has three kinds of model, namely Maponlymodel,mapreducemodel,mapreducemergemodel. That is, you can use the M model and Mr Model for normal flow calculations or small batches, and use MRM model when cross-batch operations are required. Model is a random combination of series, interface compared to MapReduce is actually quite flexible or even too flexible, the disadvantage of flexibility is the computational model brings complexity.
Galaxy SQL
Galaxy SQL is a streamsql, and is not currently available in the industry. The syntax for Galaxy SQL is close to hivesql, but some streams are not supported by semantics (infinite data flow), such as limit, order by.
Intel has a combination of spark streaming + spark SQL, called StreamSQL. Using the Schemardd in Spark SQL, the schema meta-information is used for the rdd that spark streaming flows in. With the support of spark streaming, this streamsql can do SQL calculations for sliding window effects. However, the incremental semantics of True cross-batches (not just fixed window cross-batch calculations) are not supported. Galaxy SQL can do real incremental streaming SQL.
To give the simplest example,
INSERT INTO T2 select t1.a as K, Count (t1.b) as CNT from T1 GROUP by T1.a;select Count (CNT) from T2 Group by t2.cnt;
In the first sentence of SQL, a count value is calculated based on the T1 's a field grouping. In the second sentence of SQL, the T2 table Group field becomes the CNT value from the count of T1. As you can imagine, in the flow calculation scenario, the first time a count out of a value may be 100, the next point in time, the same as a key,count out of the value is 200, then, 100 this CNT has been thrown into the T2 table calculated results, now 100 has been updated to 200, 200 The calculation of this new value is simple, but the question is how to undo the results of the previous 100 calculation in T2?
You can think about it, StreamSQL can't do this kind of SQL, essentially because spark streaming doesn't support such operations. The merge phase of the Galaxy Computing Framework can do rollback operations, rolling back the "error" state before the Galaxy SQL can do distributed streaming SQL.
Galaxy Operator
The Galaxy operator is a layer of DAG encapsulation on top of the Galaxy MRM programming interface, with ease of use and expressive power.
The operator layer will eventually be mapped into multiple Galaxy MRM model, allowing users to focus more on computational logic, shielding more complex MRM model, especially the merge phase.
The operator layer is equivalent to the physical execution plan, which can do the optimization of node merging, predicate pushing and so on, i.e. optimization of the physical execution plan. In essence, I think the optimization of the execution plan, like the hive and Spark catalyst, can be done in this DAG in the operator layer. With this layer of operators, theoretically any DSL can be mapped and run on the Galaxy Computing framework.
The operator layer provides five classes of orthogonal basic operators: Map, Reduce,merge,shuffle,union. The five kinds of basic operators can be combined and derived into more advanced operators.
It should be noted that the operator of the reduce class is for the aggregation of data in this batch . Reduce in incremental semantics is not the same as reduce in mapreduce in bulk semantics, and reduce in incremental semantics is for this batch, and reduce in mapreduce corresponds to cross-batch data, more like merge under incremental semantics. The operators of the merge class are for cross-batch aggregation operations. Merge () corresponds to the merge phase in the MRM model, which interacts with oldvalue and is an attribute operation in an incremental scenario. It is commonly used to implement UDAF operations such as count, sum, or to implement top, distinct, and class join operations.
The Union class operator, for a multi-stream merged scene. The Union () operation is to combine multiple streams into one stream output, requiring the columns of each stream to be aligned and consistent. The mix () operation is also a multi-stream merge into one, but the internal indicates whether the data from the Zoo or the right stream, the column of the flow can be inconsistent, the subsequent can be linked to the aggregation of the batch or cross-batch operations. Mix () is an interface specifically designed for aggregate operations.
Functionally, the operator layer can be analogous to the spark RDD. Spark RDD has two core values: one, at the API level, to circumvent the abstraction and uncomfortable native interfaces of the MapReduce model, providing a variety of transformations and actions that are easy for developers to understand and use Second, at the computational level, the reuse of intermediate data through the persistence of RDD is achieved in the process of batch computing, so that spark was first known as the Memory Computing framework for iterative computing, reuse data. The Galaxy operator layer, on the one hand, the operator layer and the spark Rdd, the API design with Flumejava design concept, both ease of use and expression ability, on the other hand, the Galaxy's incremental computing model is "stateful computing", natural to the real-time data between the batch "state "Reuse (in merge phase).
Subsequent
Then there is time, hoping to introduce the task model of Galaxy, the management of state and fault tolerance.
Talk about Ali Incremental Computing framework Galaxy: Incremental Computing Model (II)