Distributed Computing operator Layer

Source: Internet
Author: User
Tags hadoop mapreduce

This article is my understanding and thoughts on the operator layer of distributed computing. Because recently my development tasks are also related to this issue, the company's self-developed streaming computing framework requires an operator layer. I mainly analyze the implementation of Operators on stream systems, compare the existing computing frameworks with the projects being carried out in the industry, and analyze thisSurfaceAnd the underlyingDescriptionAnd canImagination.


Trend

Yahoo! The pig on storm project enables pig-Latin to be executed on the storm stream engine, and finally enables pig-Latin to be mixed for stream computing and batch computing scenarios. It should be said that both spark, summingbird, and pig are trying to do the same thing:Express (near) Real-time and offline data processing capabilities on stream and batch engines using your DSL or primitives.


Spark relies on RDD to implement small batch stream computing such as SPARK streaming. Its dstream is RDD. Therefore, the APIs for writing Batch jobs and streaming jobs on spark are naturally unified.


Summingbird unifies storm and hadoop jobs at the API level, and uses cascading for hadoop job writing. In terms of attributes, summingbird is more of a kindAdaptationAlthough summingbird is also called a Lambda architecture solution.


Summary: On the surface, DSL needs to support different computing engines to achieve mixed operator-level mixing. This is a trend. So what is the difficulty of implementation?


Challenges

The implementation of pig-Latin in a stream system is born from the DSL in the batchcompute scenario. For some relational operations, there will be semantic ambiguity. For details, refer to the preliminary discussion of pig on storm. For filter, foreach, union, and even slightly complex points, the State distinct and limit are not ambiguous in both batch and stream scenarios, there will not be much difference or difficulty in implementation. HoweverLike join in SQL semantics for two streams or group in pig semantics for multiple streams, the implementation on stream is inconsistent during cross, and the definition of this primitive is also different..


The key to implementing DSL or flumejava in a stream system is implementing UDAF. To implement UDAF involves cross-Batch Tasks. In essence, this event requires engine support. For example, if Trident has spoutcoordinator for throttling and transaction-related, when you want to implement cross-Batch UDAF, you can use the Trident state, that is, the auxiliary storage, to call operations such as persistaggregate. If the engine does not support it, such as the Native storm interface, it cannot be used as a stream DSL.


Spark itself is not a stream system. It can implement DSL on spark streaming and even run streaming SQL in combination with spark SQL, the reason is that spark is a batch computing framework, so it can be used as a stream-like DSL.


Conclusion: In terms of implementation, the difficulty of DSL implementation in stream systems lies in UDAF, which is essentially cross-Batch computing. So what kind of mode can the cross-Batch stream be abstracted?


In theory, incremental computing can include batch computing, stream computing, and Iterative Computing. How can this problem be solved. Incremental computing can be expressed Newvalue = function (currentvalue, oldvalue)While newvalue is stored as oldvalue and the new currentvalue will continue to generate a relationship, and this inherited oldvalue is the incremental computing result.
What is the relationship between incremental computing and the implementation operators on the stream system mentioned above? This incremental model is a form of cross-Batch computing. Function can be understood as an operator, currentvalue can be understood as the Batch Calculation result, and oldvalue can be understood as the UDAF calculation result.
Is this model only implemented by a stream system? No, the batchcompute framework can also be used. The newvalue is stored on the disk every time. If hadoop Mr is used to do this Each Mr Data is treated as a batch, The results of different batches are additionally saved. If RDD does this, it will be different. The above model is suitable for RDD, because Iterative Computing can be seen as a type of incremental computing.RDD is very good at building Dag to complete Iterative Computing, but every time it is computed, it is the new RDD of immutable.
How does a stream system implement this incremental computing model? This is the result of the wisdom of our elders and colleagues. Actually It is not difficult to implement it. The difficulty is that oldvalue must be fault-tolerant within the computing framework.. RDD does not need to worry about fault tolerance, because lineage is used to record data, which is a big deal and can be recalculated in parallel. Storm and Trident do not have to worry about fault tolerance because they have handed over the Fail logic to users! The current incremental computing engine in our group has completed this task and has been working hard on checkpoint optimization.
Conclusion: In the computing model, implementing the incremental computing engine on the stream system is a necessary condition for implementing a rich operator layer and streaming SQL. Is there any essential defect in the stream-Based Incremental computing model?
In-depth RDD

Before I shared spark SQL in Spark Meetup in Hangzhou, I mentioned spark.The most important two meanings of RDD: Rich primitives and Data Representation capabilities.The former makes spark programming very easy, and the latter makes computing results reuse, and adapts to the Mr model, Iterative Computing Model, and BSP model.. Based on these two points, spark core can easily derive SQL products, machine learning products, graph computing products, and streamcompute products.

In contrast, stream systems, such as storm, are easy-to-use primitives. The problem is, can you reuse your data ?! What are the advantages of reuse? Taking RDD as an example, it saves memory space and concurrent computing power. RDD was immutable at the beginning of design, and mapreduce was digested in the computation, which exposed a wide range of transformation and actions. In this paper, RDD and DSM (Distributed Shared Memory) are also compared in multiple dimensions. It should be said that,Matei's experience in developing hadoop mapreduce Source Code prior to RDD design, coupled with the differential Design of DSM in other systems at that time, and the concept of Google flumejava and Microsoft dryadlinq at the API level, finally integrated into RDD.. Now only spark has implemented it.

Recently, the operator layer I implemented on the incremental computing engine is designed based on flumejava, Trident, and RDD and is still being tested. As I said at the beginning, pig on storm is a superficial engine. The implication is the mixing of operators,The final imaginary space is a unified Dag, which undertakes pig, hive, SQL and other DSL, and connects to different computing systems below. It is not difficult to implement it. The difficulty may not be a technical problem.


Conclusion: The two fatal advantages of RDD: easy to use and data reuse are hard to achieve by other systems, especially the second point, which is also the essence of RDD.


Comparison with storm

Marz has implemented storm. After elephentdb, according to his understanding, a solution is proposed in how to beat cap. In his Lambda Achitecture, storm is positioned in stream processing, while the service layer similar to ad-hoc is hbase. If it is our current incremental computing framework vision, I think the stream and ad-hoc layers are expected to be unified by the incremental computing engine. Why?


Query = function (all data)


Static data, query, ad-hoc computing, static data, stream computing, dynamic data, and continuous computing. Storm is in the second place, and the incremental computing framework can be a third party. Storm topology submission is a serious problem. When Nimbus pulls bolts and spouts, the daylily is cold. It is indeed suitable for stream computing. Why? Because the essence of stream is message. Storm abstracts the layer topology and the message channel and ACK mechanism between bolts,This layer of abstraction satisfies streaming computing, but the work layer and scheduling layer are far from meeting the needs of stream computing scenarios where queries are constantly changing.. The framework we are working on will satisfy this problem in the future. From then on, we have unified streaming, batch, and iteration, surpassing the current stream computing, not just streamsql, all DSL on stream can be implemented through the operator layer.


Conclusion: How can we solve data-driven and query-driven scenarios in a unified manner? The imagination of incremental computing is huge, and the importance of the operator layer is prominent.


Full Text :)

Distributed Computing operator Layer

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.