Depending on the use scenario, large data processing is gradually evolving to two extremes-batch processing and streaming. The streaming processing pays more attention to the real-time analysis of the data, and represents the storm and S4 of the tools. and batch processing is more focused on the long-term data mining, the typical tool is derived from the three major Google paper Hadoop.
With the "bursting" of data, companies are racking their brains over large data processing, with the aim of being faster and more accurate. However, the recent new Open-source tool, Summingbird, has broken the rhythm and is a step closer to seamless integration of streaming and batch processing.
Development background
As we all know, Twitter's system basically completes the service-oriented architecture transformation, and many services have different requirements for data processing, which inevitably occurs: similar to the trending topics and search services in the beginning with real-time processing needs, And the value of the data need to go through the final deep excavation-batch processing. This reduces the importance of the cost of conversion is obvious, Summingbird emerged.
Sunmmingbird
Related introduction
Twitter, on September 3, opens up a large data processing system called Summingbird, which reduces conversion overhead by consolidating batch and stream processing.
The introduction of Summingbird from Twitter also learned that developers can perform mapreduce jobs on Summingbird with very close native Scala or Java Here's an example of a word-counting using pure Scala:
And doing word-counting in Summingbird requires code like this.
It's easy to see that they have the same logic and almost identical code, but the difference is that you can either use the Summingbird project as a "batch" (scalding) or use it for "real-time processing" (using storm); You can also use a mixture of two patterns to bring unmatched fault tolerance to your application.
Core Concepts
Summingbird jobs produce two types of data: streams and snapshots (snapshot). The stream contains all the history of the data, and the store contains a snapshot of the system at a specified time. The Summingbird core is implemented through a number of components:
Producer--producer is a Summingbird data flow abstraction that is passed to a specific platform to do mapreduce stream compilation.
The Platform--platform instance can be used for any stream MapReduce library implementation, and the Summingbird library contains Platform support for storm, scalding, and memory processing.
Source--source represents the source of a data, each system has its own definition of the data source, for example, the memory platform defines SOURCE[T] as any traversableonce[t.
Store--store is the place where the Summingbird midstream MapReduce is "reduce", which contains a snapshot of all key-corresponding value aggregates.
Unlike Store,sink, which allows you to form a sink--stream that embodies producer values, Sink is a stream rather than a snapshot.
Service--service allows the user to perform "lookup join" or "leftjoin" on the current value in the producer stream, and the connected value can be a snapshot from another store, or a stream of another sink, Even from some other asynchronous functions.
The Plan--plan is generated by the platform call Platform.plan (producer) as the final implementation of the MapReduce stream. For storm, plan is a stormtopology instance that the user can perform through the storm provided. For the memory platform, plan is a memory stream that contains the output that is delivered producer.
For more information, please visit: Core concepts of Summingbird
Related projects
Summingbird has spawned a number of subprojects, which must be focused on:
Algebird--scala's abstract algebra library, Algebird Many data structures are implemented monoid, so that they can better summingbird aggregation.
Bijection--summingbird uses the injection of bijection projects to share serialization between different clients and the execution platform.
Chill--summingbird's storm and scalding platforms use Kryo libraries for serialization, Chill is a good addition to Kryo, including many available configuration options, and provides storm, Scala, The use model of Hadoop. Chill is also used in the spark of the Berkeley Amp Lab.
Tormenta--tormenta provides a type-safe layer on the scheme and spout interfaces of storm.
The Storehaus--summingbird client realizes through the Storehaus of the asynchronous key value, Storm platform utilizes the mergeablestore characteristic of Storehaus to achieve the real-time aggregation of some common backing storage, including Memcache and Redis.
Future plans
Support more platforms, spark and Akka "Brunt"
Pluggable optimization of producer layers
Supports filtered data sources, such as parquet
Inject more advanced math and machine learning code for producer Primitives
Implement more extensions through related projects
Free more tutorials through public data sources