Apache Beam: The next generation of big data processing standards

Last Update:2018-06-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Apache Beam (formerly Google DataFlow) is the Apache incubation project that Google contributed to the Apache Foundation in February 2016 and is considered to be following Mapreduce,gfs and BigQuery, Google has also made a significant contribution to the open source community in the area of big data processing. The main goal of Apache beam is to unify the programming paradigm for batch and stream processing, providing a simple, flexible, feature-rich, and highly expressive SDK for infinite, disorderly, web-scale data set processing. The Apache Beam project focuses on the programming paradigm and interface definitions of data processing and does not involve implementation of specific execution engines, and Apache beam hopes that data handlers developed based on beam can be executed on any distributed computing engine. This paper mainly introduces the programming paradigm-beam model of Apache Beam, and how to write distributed data processing business logic conveniently and flexibly through Beam SDK, hoping readers can have a preliminary understanding of Apache Beam through this article. At the same time, it has a preliminary understanding of how the Distributed data processing system can handle the chaotic infinite data stream.

Apache Beam Basic Architecture

With the continuous development of distributed data processing, new distributed data processing technology has been put forward, the industry has emerged more and more distributed data processing framework, from the earliest Hadoop MapReduce, to Apache Spark,apache Storm, and more near Apache Flink,apache apex and so on. The new distributed processing framework may bring higher performance, more powerful functionality, lower latency, etc., but the cost of user switching to the new distributed processing framework is also significant: a new data processing framework needs to be learned and all business logic rewritten. The idea of solving this problem consists of two parts, first, a programming paradigm that unifies and regulates the requirements of distributed data processing, such as the need for unified batching and streaming. Second, the generated distributed data Processing task should be able to execute on each distributed execution engine, the user can freely switch the Distributed data Processing task of the execution engine and execution environment. Apache beam is precisely to solve the above problems and put forward.

Apache Beam is mainly composed of Beam SDK and Beam runner, the Beam SDK defines the API interface for developing Distributed data Processing task business logic, and the generated distributed data Processing task pipeline to the specific Beam runner execution engine. The API interfaces currently supported by Apache Beam are implemented by the Java language, and the Python version of the API is under development. Apache Beam supports the underlying execution engine including Apache Flink,apache Spark and Google Cloud Platform, in addition to Apache Storm,apache Hadoop,apache Support for execution engines such as Gearpump is also being discussed or developed. The basic architecture is as follows:

Figure 1 Apache Beam frame composition

It is important to note that although the Apache beam community very much wants all the beam execution engines to be able to support the complete collection of functions defined by the beam SDK, it may not be necessary in actual implementations. For example, MapReduce-based runner is obviously difficult to implement and stream-processing-related features. Google DataFlow Cloud is currently the most comprehensive execution engine for the Beam SDK feature set, and the most comprehensive support for the open source execution engine is the Apache Flink.

Beam Model

Beam model refers to the Beam programming paradigm, which is the design idea behind the Beam SDK. Before introducing beam model, a brief introduction of the problem domain and some basic concepts to be dealt with in beam model.

Data. The data types to be processed in distributed data processing can be divided into two kinds, limited data sets and infinite data streams. A limited set of data, such as a file in HDFs, an hbase table, etc., is characterized by the fact that the data is already in advance and is generally persistent and does not suddenly disappear. and unlimited data streams, such as the Kafka stream, or the Twitter stream from the Twitter API, are characterized by a dynamic flow of data that is endless and unsustainable. In general, the design goal of a batch framework is to handle a limited set of data, and the flow processing framework is designed to handle unlimited data flows. A finite set of data can be seen as a special case of infinite data flow, but there is no difference from the data processing logic, for example, assuming that the microblog data contains timestamps and forwards, and that the user wants to sum up the amount of forwarding per hour, the business logic should be able to execute on both the finite dataset and the infinite data stream. You should not have any effect on the implementation of the business logic because of different data sources.
Time. Process time refers to when data enters the distributed processing framework, while Event-time refers to the time the data is generated. These two times are usually different, for example, for a stream computing task that processes micro-blogging data, a 2016-06-01-12:00:00-published microblog may go through a network transmission delay before the 2016-06-01-12:01:30 enters the stream processing system. Batch tasks usually take a full amount of data calculation, less attention to the time attribute of the data, but for the flow processing task, because the data flow is relentless, unable to perform the full amount of computation, usually a window to calculate the data, for most of the flow processing tasks, according to time window division, May be the most common requirement.
Disorderly order. The order in which data is processed by the stream processing framework may not be strictly event-time in chronological order. If you define a time window based on process times, the order in which the data arrives is the order of the data, so there is no problem with the disorder. However, for time windows that are based on the event-time definition, there may be situations where the pre-existing messages arrive after a time-dependent message, which can be very common in a distributed data source. In this case, how to determine the late data, and how late data handling is usually a tricky problem.

The target data of the Beam model processing is an infinite sequence of temporal data streams, regardless of the time sequence or the limited data set can be regarded as a special case of infinite chaotic data stream. The Beam model summarizes the issues that users need to consider when doing data processing from the following four dimensions:

What. How is the data calculated? For example, sum,join or training learning models in machine learning. Specified in the Beam SDK by the operator in the pipeline.
Where. What is the range of data calculated? For example, a time window based on process-time, a time window based on event-time, a sliding window, and so on. specified in BEAMSDK by a window in pipeline.
When. When will the calculation result be output? For example, in the 1-hour event-time time window, the current window evaluates to the output every 1 minutes. Specified in the Beam SDK by watermark and triggers in pipeline.
How. How is late data handled? For example, the late data is calculated as an incremental result output, or the results of late data calculations and in-window data calculations are combined to output the result. specified by accumulation in the beam SDK.

The Beam Model abstracts the "WWWH" four dimensions into the Beam SDK, in which the user constructs the data processing business logic based on the Beam SDK, and in each step only needs to invoke the specific API according to the business requirements in accordance with the four dimensions to generate a distributed data processing pipeline, and commit to execution on the specific execution engine. The abstraction of "WWWH" four dimensions focuses solely on the business logic itself, and does nothing to do with how distributed tasks are performed.

Beam SDK

Unlike the Apache Flink or Apache Spark,beam SDK, the same set of APIs is used to represent data sources, output targets, and operators. 4 Data processing tasks based on the Beam SDK are described below, and through these four data processing tasks, readers can see how the beam mode is a unified and flexible description of batch and stream processing tasks, 4 tasks to handle the statistical requirements of the mobile game domain, including:

User score. A batch task that counts user scores based on a finite set of data.
Team score per hour. Batch tasks, based on a finite set of datasets, statistics per hour, per team score.
List. Stream processing tasks, 2 stats per hour, each team's score and the user's real-time historical score.
Game state. Stream processing tasks, counting the scores per team for each hour, and more complex hourly statistics, such as hourly per-user online time.

Note: The sample code comes from beam source code, the specific address see: Apache/incubator-beam. Some of the analysis references the official documents of Beam, please refer to the reference link for details.

The following is based on the beam model of the "WWWH" four dimensions, analysis of business logic, and through the code shows how to achieve "WWWH" four dimensions of business logic through the beam SDK.

User Score

To count each user's historical score is a very simple task, here we simply through a batch task implementation, each time we need new user score data, re-execute this batch task. For user score tasks, the "WWWH" four-dimensional analysis results are as follows:

Through the analysis of "WWWH", the code implemented by the beam Java SDK for the user score of this batch task is as follows:

gameEvents  [... input ...]    [... parse ...]      .apply("ExtractUserScore", new ExtractAndSumScore("user"))   [... output ...];

Extractandsumscore implements the logic described in "what", that is, grouping by user, then accumulating fractions, with the relevant code as follows:

gameInfo .apply(MapElements     .via((GameActionInfo gInfo) -> KV.of(gInfo.getKey(field), gInfo.getScore()))     .withOutputType(         TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.integers()))) .apply(Sum.<String>integersPerKey());

By Mapelements, the key and value are determined by the user and the score, then the sum definition is grouped by key and the score is accumulated. Beam supports merging multiple operations of data into one operation, which not only enables clearer business logic implementations, but also reuses the combined operational logic in multiple places.

Team score per hour

The team that gets the highest score is likely to get rewarded by the hourly count of each team's score, which increases the requirements for the window, but we can still do this with a batch task, and the "WWWH" four dimensions of the task are analyzed as follows:

In relation to the first user score task, just in the where section answered "What is the data in the range calculated?" "How do you calculate the data?" , the conditions for grouping are changed from user to Team, which is also reflected in the code:

gameEvents  [... input ...]  [... parse ...]  .apply("AddEventTimestamps", WithTimestamps.of((GameActionInfo i)    -> new Instant(i.getTimestamp())))  .apply("FixedWindowsTeam", Window.<GameActionInfo>into(    FixedWindows.of(Duration.standardMinutes(windowDuration))))  .apply("ExtractTeamScore", new ExtractAndSumScore("team"))  [... output ...];

"Addeventtimestamps" defines how to extract eventtime data from raw data, "Fixedwindowsteam" defines a 1-hour fixed window and then reuses the Extractandsumscore class, Just change the grouped columns from the user to the team. For the hourly Team score task, a new business logic is introduced about the "where" part of the window definition, but as you can see from the code, the implementation of the "where" section and the implementation of the "What" section are completely independent, and the user only needs to add two new lines of code about "where" Very simple and clear.

List

The first two tasks are batch tasks based on finite datasets, and for leaderboards we also need to count user scores and hourly team scores, but expect real-time data from a business perspective. For Apache Beam, the only difference between a batch task and a stream processing task of the same processing logic is the input and output of the task, and the business logic in the middle pipeline without any change. For the leaderboard data Analysis task for the current example, we not only want them to meet the same business logic as the first two examples, but also to meet more customized business requirements, such as:

A very important feature of a stream processing task relative to a batch task is that the stream processing task can return the results of the calculation in more real time, for example, when calculating the hourly team score, for an hour time window, the default is to output the final settlement result after one hour of data has arrived, However, the flow processing system should support the output of some of the computed results when only part of the data arrives in an hour window, thus allowing the user to get real-time analysis results.
Ensure that the calculation results are consistent with the batch task. How to determine if all data arrives (Watermark) for a certain calculation window due to the existence of chaotic data? How is late data handled? How does the processing result output, total, increment, and juxtaposition? The flow processing system should provide a mechanism to ensure that the user can achieve the final calculation accuracy while satisfying the low latency performance.

These two problems are the answer to the "when" and "how" two questions to define the user's data analysis requirements. "When" depends on how often the user wants to get the results of the calculation, in the answer "when", basically can be divided into four stages:

Early. Determines when to output intermediate state data before the window ends.
On-time. At the end of the window, the Output window data evaluates. Due to the existence of chaotic sequence data, how to determine the end of the window may be the user based on additional knowledge estimates, and allow the user to be the end of the window after the completion of the data belonging to the window.
Late. At the end of the window, there is late data arriving, at this stage, when the calculation results are output.
Final. Can tolerate the maximum of late, for example, 1 hours. After the final wait time is reached, the final calculation results are output, and the status data of the window is cleared after the late data is not accepted.

For the flow Processing task for team scores per hour, this example expects the business logic to be a 1-hour window based on event time that calculates scores by team, outputs the current team score every 5 minutes in an hourly window, and outputs the current team score every 10 minutes for late data, Data that is late 2 hours after the end of the window is generally not likely to occur, and if so, discard it directly. The expression "WWWH" is as follows:

In the Beam SDK-based implementation, the user's business logic based on the "WWWH" Beam model can be implemented independently and directly:

gameEvents [... input ...]  .apply("LeaderboardTeamFixedWindows", Window    .<GameActionInfo>into(FixedWindows.of(      Duration.standardMinutes(Durations.minutes(60))))    .triggering(AfterWatermark.pastEndOfWindow()      .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()        .plusDelayOf(Durations.minutes(5)))      .withLateFirings(AfterProcessingTime.pastFirstElementInPane()        .plusDelayOf(Durations.minutes(10))))    .withAllowedLateness(Duration.standardMinutes(120)    .accumulatingFiredPanes())  .apply("ExtractTeamScore", new ExtractAndSumScore("team"))  [... output ...]

Leaderboardteamfixedwindows corresponds to the "where" definition window, trigger corresponds to "where" to define the result output condition, accumulation corresponds to "how" to define the output result content, The extractteamscore corresponds to "what" to define the calculation logic.

Summarize

Apache Beam's Beam model is an elegant abstraction for data processing of infinite chaotic streams, and the description of the data processing in "WWWH" four dimensions is very clear and reasonable, and Beam model unifies the processing mode of infinite data stream and finite data set, It also clarifies the programming paradigm of the data processing mode of the infinite data stream, expands the business scope that the flow processing system can apply, for example, the support of the Event-time/session window, the processing support of the disorderly sequence data, etc. Apache Flink,apache Spark streaming and other project API design are more and more reference to the Apache Beam Model, and as Beam Runner implementation, and Beam SDK compatibility is also increasing. This paper mainly introduces the beam model and how to design the data Processing task based on beam model, hoping to give readers a preliminary understanding of the Apache beam project. Since Apache Beam has been hatched in Apache incubator, readers can also learn more about the progress and status of Apache beam through the official website or the mailing list.

Big Data Learning Exchange Capital Group: 784557197

Apache Beam: The next generation of big data processing standards

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More