Twitter Open source Summingbird: Consolidated batch processing and flow processing under near-native coding

Source: Internet
Author: User
Keywords Integration open source can through achieve

Depending on the use scenario, large data processing is gradually evolving to two extremes-batch processing and streaming. The streaming processing pays more attention to the real-time analysis of the data, and represents the storm and S4 of the tools. and batch processing is more focused on the long-term data mining, the typical tool is derived from the three major Google paper Hadoop.

With the "bursting" of data, companies are racking their brains over large data processing, with the aim of being faster and more accurate. However, the recent new Open-source tool, Summingbird, has broken the rhythm and is a step closer to seamless integration of streaming and batch processing.

Development background

As we all know, Twitter's system basically completes the service-oriented architecture transformation, and many services have different requirements for data processing, which inevitably occurs: similar to the trending topics and search services in the beginning with real-time processing needs, And the value of the data need to go through the final deep excavation-batch processing. This reduces the importance of the cost of conversion is obvious, Summingbird emerged.

Sunmmingbird

Related introduction

Twitter, on September 3, opens up a large data processing system called Summingbird, which reduces conversion overhead by consolidating batch and stream processing.

The introduction of Summingbird from Twitter also learned that developers can perform mapreduce jobs on Summingbird with very close native Scala or Java Here's an example of a word-counting using pure Scala:

And doing word-counting in Summingbird requires code like this.

It's easy to see that they have the same logic and almost identical code, but the difference is that you can either use the Summingbird project as a "batch" (scalding) or use it for "real-time processing" (using storm); You can also use a mixture of two patterns to bring unmatched fault tolerance to your application.

Core Concepts

Summingbird jobs produce two types of data: streams and snapshots (snapshot). The stream contains all the history of the data, and the store contains a snapshot of the system at a specified time. The Summingbird core is implemented through a number of components:

Producer--producer is a Summingbird data flow abstraction that is passed to a specific platform to do mapreduce stream compilation.

The Platform--platform instance can be used for any stream MapReduce library implementation, and the Summingbird library contains Platform support for storm, scalding, and memory processing.

Source--source represents the source of a data, each system has its own definition of the data source, for example, the memory platform defines SOURCE[T] as any traversableonce[t.

Store--store is the place where the Summingbird midstream MapReduce is "reduce", which contains a snapshot of all key-corresponding value aggregates.

Unlike Store,sink, which allows you to form a sink--stream that embodies producer values, Sink is a stream rather than a snapshot.

Service--service allows the user to perform "lookup join" or "leftjoin" on the current value in the producer stream, and the connected value can be a snapshot from another store, or a stream of another sink, Even from some other asynchronous functions.

The Plan--plan is generated by the platform call Platform.plan (producer) as the final implementation of the MapReduce stream. For storm, plan is a stormtopology instance that the user can perform through the storm provided. For the memory platform, plan is a memory stream that contains the output that is delivered producer.

For more information, please visit: Core concepts of Summingbird

Related projects

Summingbird has spawned a number of subprojects, which must be focused on:

Algebird--scala's abstract algebra library, Algebird Many data structures are implemented monoid, so that they can better summingbird aggregation.

Bijection--summingbird uses the injection of bijection projects to share serialization between different clients and the execution platform.

Chill--summingbird's storm and scalding platforms use Kryo libraries for serialization, Chill is a good addition to Kryo, including many available configuration options, and provides storm, Scala, The use model of Hadoop. Chill is also used in the spark of the Berkeley Amp Lab.

Tormenta--tormenta provides a type-safe layer on the scheme and spout interfaces of storm.

The Storehaus--summingbird client realizes through the Storehaus of the asynchronous key value, Storm platform utilizes the mergeablestore characteristic of Storehaus to achieve the real-time aggregation of some common backing storage, including Memcache and Redis.

Future plans

Support more platforms, spark and Akka "Brunt"

Pluggable optimization of producer layers

Supports filtered data sources, such as parquet

Inject more advanced math and machine learning code for producer Primitives

Implement more extensions through related projects

Free more tutorials through public data sources

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.