A preliminary study of Apache Beam

Source: Internet
Author: User
Tags shuffle apache flink apache beam

Article Luxianghao

Article Source: http://www.cnblogs.com/luxianghao/p/9010748.html reprint Please specify, thank you for your cooperation.

Disclaimer: The content of the article represents only personal opinions, if not, please correct me.

---

An introduction

In February 2016, Google announced that Beam (formerly Google DataFlow) had contributed to the Apache Foundation to hatch and become a top-level open source project for Apache.

Beam is a unified programming framework that supports batch processing and stream processing, and can be built using the beam programming model in multiple compute engines (Apache Apex, Apache Flink, Apache Spark, Google Cloud Dataflow, etc.) run on.

Big data originated in three papers released by Google in 2003, GoogleFS, MapReduce, BigTable, history, said Troika, unfortunately, Google published in the paper and did not publish its source code, but the Apache open source community to flourish, Has appeared Hadoop,spark,apache Flink and other products, and Google internal use of closed source bigtable, Spanner, Millwheel, but they the same, this time Google did not send a paper and then disappeared, But high-profile open source beam, the so-called "first-class companies set standards," the benefits of open source is quite many, such as improve the company's influence, brainstorming, common maintenance and so on.

Two beam advantages

1 unification

Beam provides a unified programming model, programming guidance can refer to the official programming-guide, through Quickstart-java and Wordcount-example get started.

2 Portable

Programmers do not need to focus on the coding of the future code will run to spark, Flink or other computing platform, coding the completion of the command line to select the computing platform.

3 Expandable

As written in 2 portability , Directrunner and Sparkrunner have been shown, in beam, RUNNER,IO linker, conversion Operations Library, and even the SDK can be customized, with a high degree of extensibility.

4 support for batch and stream processing

Programs written by Beam can be executed without modification, regardless of whether the programmer writes with beam for batch or stream processing, for a limited dataset or for an unlimited set of data. (Beam to solve this problem by introducing the concept of triggering,windows)

5 High Abstraction

Beam is highly abstracted with a dag (directed acyclic graph), where programmers do not need to force their code into Map-shuffle-reduce form, and can perform higher-level operations directly, such as counting, joining, Projecting.

6 Multi-lingual support

Java and Python are now officially supported, and more languages will be developed later in the SDK.

Three beam composition

Let's start with a whole frame diagram.

1 Beam programming model

Beam's programming model is the abstraction of Google's engineers from a number of big data-processing projects such as MapReduce, Flumejava, and Millwheel, and if you want to learn more about them, you can refer to the relevant candidates and papers, streaming 101,streaming 102 and VLDB paper. The programming model consists of several core concepts, including the following:

    1. Pcollection: A dataset that represents the set of data that will be processed, either a limited set of data or an infinite stream of data.
    2. Ptransform: The computational process, which represents the process of processing an input dataset into an output data set,
    3. Pipeline: The pipeline, which represents the execution task of processing data, is visualized as a directed acyclic graph (DAG), Pcollections is a node, and transforms is an edge.
    4. Pipelinerunner: Actuator that specifies where and how the pipeline will run.

Ptransform also includes many operations, such as:

    • ParDo: Generic parallel processing of ptranform, equivalent to the map in Map/shuffle/reduce-style, can be used for filtering, type conversion, extracting part of the data, calculating each element in the data, etc.
    • Groupbykey: Used to aggregate key/value pairs, equivalent to Shuffle in Map/shuffle/reduce-style, aggregating the value of those with the same key
    • Cogroupbykey: Used to aggregate multiple collections, functions and Groupbykey are similar
    • Combine: Processing data in a collection, such as Sum, Min, and Max (SDK predefined), or self-building a new class
    • Flatten: Used to combine multiple datasets into a single data set
    • Partition: Used to split a data set into multiple small datasets

In addition, there are core concepts such as:

    • Windowing: Dividing elements of the pcollections dataset into subsets by timestamps
    • Watermark: How long after the delay data has been marked directly discarded
    • Triggers: Used to determine when to send aggregated results for each window

The beam programming model can be simply summed up as

[Output pcollection] = [Input pcollection].apply ([Transform])

Google engineers also abstracted the scene of beam programming into four questions, which is WWWH

    • What is the calculation, the corresponding abstract concept is Ptransform

    • That is, in which timeframe, the corresponding abstract concept is window

    • That is, when the calculation results are output, the corresponding abstract concepts are watermarks and triggers

    • That is, how to extract the relevant data, the corresponding abstract concept is accumulation

Note: The translation here is based on the reference to streaming 102, may be purely literal translation does not achieve the desired effect, if there is inappropriate place to welcome correct.

2 SDK

Beam supports the use of multiple-language SDKs to construct the pipeline, which currently supports Java and Python, which is relatively better for Java SDK support.

3 Runner

Beam supports running pipeline on multiple distributed backend and currently supports the following pipelinerunners:

    • Directrunner: Execute pipeline locally
    • Apexrunner: Run on yarn cluster (or in embeded mode) pipeline
    • Dataflowrunner: Run on Google Cloud dataflow pipleine
    • Flinkrunner: Running on a flink cluster pipeline
    • Sparkrunner: Running pipeline on the spark cluster
    • Mrrunner: Currently in the beam GitHub main branch has not yet provided, but has Mr-runner branch, the specific also can refer to BEAM-165

Four examples

Through the official WordCount example to actually experience the next beam, detailed can refer to Quickstart-java and wordcount-example.

1 Getting the relevant code

mvn archetype:generate        -darchetypegroupid=org.apache.beam       -darchetypeartifactid=beam-sdks-java-maven-archetypes-  Examples       -darchetypeversion=2.1. 0 -dgroupid= org.example-dartifactid=word-count-beam       -dversion=                " 0.1 "         -dpackage=org.apache.beam.examples       -dinteractivemode=false

2 Related Documents

$ cd word-count-beam/lspom.xml    ls src/main/java/org/apache/beam/examples/ Debuggingwordcount.java    Windowedwordcount.java    Commonminimalwordcount.java    Wordcount.java

3 Execution with Drectrunner

MVN compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount-dexec.args="--inputfile= Pom.xml--output=counts"

4 Submit to spark

Mode 1

MVN compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount-dexec.args="--runner= Sparkrunner--inputfile=pom.xml--output=counts" -pspark-runner  

Mode 2

Spark-submit--class org.apache.beam.examples.WordCount--master local target/word-count-beam-bundled-0.1. Jar--runner=sparkrunner--inputfile=pom.xml--output=counts

Mode 3

Spark-submit--class org.apache.beam.examples.WordCount--master yarn--deploy-mode cluster word-count-beam-bundled- 0.1. Jar--runner=sparkrunner--inputfile=/home/yarn/software/java/license   

Sparkrunner details here, where mode 3 read the HDFs file, there will be some problems, this problem we will say here, the above example can be written on the physical machine on the actual existence of the file, so that the program can be used to ensure proper operation.

Five references

Programming Guide Https://beam.apache.org/documentation/programming-guide
Example https://beam.apache.org/get-started/wordcount-example/
Javadoc https://beam.apache.org/documentation/sdks/javadoc/2.1.0/
Streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
streaming-102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

A preliminary study of Apache Beam

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.