Talking about the operator layer of distributed computing

Last Update:2014-12-22 Source: Internet

Author: User

Keywords nbsp implementation can increment value

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; This article is my understanding and thoughts on the level of the operators of distributed computing. Because the recent development of their own task is related to this aspect, the company has a self-study of the class flow calculation framework needs to do a layer of operator. My main analysis is the flow of the implementation of the operator on the system, in contrast to the existing computing framework and the industry is carrying out the project, analysis of the matter's surface and behind the deep meaning, as well as conceivable space.

Trend

Yahoo! 's pig on Storm project allows Pig-latin to execute on the storm stream engine, which ultimately allows Pig-latin to be mixed in both streaming and bulk-computing scenarios. It should be said that both spark,summingbird and pig are trying to do the same thing: using their DSL or primitives to express (near) real-time and off-line data processing capabilities on both streaming and batch engines.

Spark itself relies on RDD, realize the spark streaming this small batch of flow calculation, its dstream is Rdd, so in the spark write batch job and streaming job API is naturally unified.

Summingbird at the API level to unify the operations on Storm and Hadoop, the task on Hadoop is written with cascading, which is more of an adaptive role, although Summingbird is also known as Lambda A architecture solution.

Summary: On the surface, the DSL needs to support different computing engines to achieve the operator-level mix, which is the trend. So where is the difficulty in implementation?

Challenge

Implement Pig-latin on a streaming system This is itself born in the batch computing scene of the DSL, some relational operation will have semantic level of ambiguity, can see pig on storm preliminary discussion. For Filter,foreach,union, even a slightly more complex point of the need to use the state of Distinct,limit, in the batch and streaming scenes are not ambiguous, the realization will not have too much difference or difficulty. But like the two streams doing the join in the SQL semantics, or the group,cross in the pig semantics, the streaming implementation is inconsistent, and the definition of the primitive is different.

To implement a DSL or a set of Flumejava on a streaming system, the key is to implement the UDAF. and to achieve UDAF, it involves a cross batch of things. This thing essentially needs the support of the engine, such as Trident have spoutcoordinator for flow control, but also have a certain degree of transactional, then you want to do across the batch between the UDAF, you can use the Trident state, that is, auxiliary storage, This is done by invoking an operation such as Persistaggregate. If the engine does not support, such as the native storm interface, there is no way to do streaming DSL.

So different as spark, because spark itself is not a streaming system, his spark streaming can implement DSL, and can even be combined with spark SQL to run streaming form of SQL, because Spark is a batch computing framework, So he can do streaming DSL.

Summary: To realize that, the implementation of DSL difficulties in streaming systems is UDAF, essentially across batches. How can a cross batch on a stream be abstracted as a pattern?

Increment calculation       Increment calculation can theoretically include batch calculation, flow calculation and iterative calculation. How to understand it. Incremental computations can be expressed as NewValue = function (CurrentValue, oldValue), while NewValue is saved as OldValue continues to be associated with the newly-arrived currentvalue. And this oldvalue is the result of incremental calculation. What does the        incremental calculation have to do with implementing operators on the aforementioned streaming systems? This incremental model is a form of cross batch computing. Function can be understood as an operator, CurrentValue can be understood as the result of this batch, oldvalue can be understood as the UDAF calculation result.        Is this model only implemented by streaming systems? No, the bulk calculation framework can also do, big deal newvalue every time. If the Hadoop Mr is doing this, it is actually treating each Mr Data as a batch, and the result of the cross batch is extra saved. If rdd to do this thing, that is different, the above model is very suitable for rdd to do, because the iterative calculation can be seen as a kind of incremental calculation, and Rdd is very good at building dag to complete the iterative calculation, but every time the calculation is the immutable of the new Rdd. How does the        streaming system implement this incremental computing model? This is our group before the boss and colleagues of the crystallization of wisdom, specifically inconvenient to say. In fact, it is not difficult to achieve it, the difficulty is that the calculation framework needs to oldvalue fault tolerance. Rdd do not worry about fault tolerance, because there are lineage to record, the big deal can be calculated, and can be parallel. Storm and Trident also do not have to worry about fault tolerance, because he put the fail logic to the user! And our group's current incremental computing engine has done this, and has been working on checkpoint optimizations.        Summary: On the computational model, the incremental computing engine is implemented on the flow system, which is a necessary condition to realize the rich operator layer and do streaming SQL. Is there any intrinsic flaw in the incremental computing model implemented on the stream? Deep Rdd

Earlier in Hangzhou Spark Meetup, when sharing Spark SQL, I mentioned Spark Rdd's two most important meanings: the richness of the primitive language and the ability to express data. The former makes the spark programming very easy, the latter makes the calculation result reuse, adapts to the Mr Model, iterative computation model, BSP model. Based on these two points, Spark core can easily derive SQL products, machine learning products, graph computing products, stream computing products.

Reverse flow system, such as storm, primitive language to be simple and rich and easy to use is not difficult, the question is your data can reuse?! What are the advantages of reuse? Take Rdd to save memory space and concurrent computing power. Rdd at the beginning of the design is immutable, and in the calculation of the internal digestion of mapreduce, and exposed a wealth of transformation and Action. In this paper, Rdd and DSM (distributed Shared Memory) are also compared in multidimensional degree. It should be said that Matei was involved in the development of the Hadoop MapReduce source code prior to the design of RDD, plus the difference design of DSM in other systems at that time, and Google Flumejava, Microsoft Dryadlinq at the API level, The final blend of rdd this set of things. Now only spark has realized it.

The operator layer I implemented recently on the incremental computing engine is also referenced by the Flumejava,trident,rdd design, and is still being tested. As I said at the beginning, Pig on Storm This thing, the change engine is the surface. Behind the meaning is the operator level of mixing, the final imaginary space is a layer of unified DAG, above to undertake pig, Hive, SQL and other DSL, the following docking different computing systems. It is not difficult to achieve, the difficult point may not be technical problems.

Summary: Rdd two fatal advantages, easy to use and data reuse, other systems are difficult to achieve, especially the 2nd, is also the essence of RDD.

contrast Storm

After Marz did Storm,elephentdb, according to his understanding, a solution was proposed in the way to beat Cap. In the lambda achitecture he proposed, the storm position was streamed, while the Hoc service layer was hbase. If this is the vision of our current incremental computing framework, I think the streaming and hoc layers are expected to be unified by an incremental computing engine. Why?

Query = Function (All Data)

Data static, query move, is hoc calculation, data movement, query static, is a flow-type calculation, data movement, query move, is continuous calculation. Storm in the second, the incremental computing framework can do the third party. Storm topology submission is a serious problem, when Nimbus pull up bolt and spout, the day Lily is cold. It does fit the flow calculation, because the nature of the flow is the message. Storm the abstract layer topology, the message channel between Bolt, the ACK mechanism is very good, this layer of abstraction to meet the flow of computation, but work this layer and the scheduling of this layer is far from satisfied with the changing query still need to flow calculation of the scene. The framework we are now working on will satisfy this matter, since then unified streaming, batch, iterative, and beyond the current flow calculation, not only streamsql,stream on the DSL can be implemented through the operator layer.

Summary: Data movement, query move scene How to solve the unified? The increment computation imagination space is huge, the operator layer importance highlights.

Original link: http://blog.csdn.net/pelick/article/details/39577785

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More