Apache Beam Anatomy

Source: Internet
Author: User
Tags apache flink apache beam

1. overview

In the wave of big data, technology updates iterate very frequently. Big data developers have a wealth of tools that are influenced by technology open source. But it also makes it more difficult for developers to choose the right Tool. When it comes to big data processing problems, the techniques used are often diversified. It all depends on the business requirements, such as MapReduce for batching, Flink for real-time streaming, Spark SQL for SQL interaction, and so On. It is conceivable that these open source frameworks, tools, libraries, and platforms can be combined with the required workload and Complexity. This is also a headache for big data developers. What is to be shared today is a solution to the integration of these resources, which is Apache Beam.

2. content

Apache Beam, originally called Apache Dataflow, donated a lot of core code to Apache by Google and its partners, and was created to hatch the Project. Most of the large size of the project comes from the Cloud Dataflow SDK, which features the following points:

    • Paradigm of unified data batching (batch) and stream processing (stream) programming
    • Can run on any executable engine

That Apache beam can solve the problem, what is its application scenario, below we can be illustrated by a graph, as shown in:

By changing the map, we can clearly see the development of the entire technology flow, part of the Google faction, the other part is the Apache Faction. When developing big data applications, We sometimes use Google's frameworks, apis, libraries, platforms, and so on, and sometimes we use apache, such as Hbase,flink,spark. And we need to integrate these resources is a headache problem, Apache Beam, The integration of these resources provides a very convenient solution.

2.1 Vision

below, we look at Beam's running flow through a flowchart, as shown in:

through, we can clearly know that the execution of a process is divided into the following steps:

    1. End Users: Choose a programming language you are familiar with submitting your app
    2. SDK Writers: The programming language must be supported by the Beam model
    3. Library writers: format converted to beam model
    4. Runner writers: processing and supporting beam data processing pipelines in a distributed environment
    5. IO Providers: Run all applications on the beam data processing pipeline
    6. DSL writers: Creating a high-order data processing pipeline
2.2 SDK

The Beam SDK provides a unified programming model to handle datasets of any size, including limited data sets and unlimited streaming Data. The Apache Beam SDK uses the same classes to express limited and unlimited data, as well as to manipulate the data using the same transformation method. Beam provides a variety of sdks, you can choose a familiar to build a data processing pipeline, such as the above 2.1 figure, we can know that the current Beam support Java,python and other languages to be developed.

2.3 Pipeline Runners

Running the engine on the Beam pipeline is based on the distributed processing engine of your choice, where the compatible API transforms your Beam application, allowing your Beam application to run efficiently on the specified distributed processing ENGINE. thus, when running the Beam program, you can choose a distributed processing engine according to your own NEEDS. The current Beam supports pipeline run engines in the following ways:

    • Apache Apex
    • Apache Flink
    • Apache Spark
    • Google Cloud Dataflow
3. Example

This example is done by using the Java SDK, which you can try to run on a different execution engine.

3.1 Development environment
    • Download Install JDK 7 or later to detect java_home environment variables
    • Download the Maven packaging Environment.

About the above installation steps, is not the focus of this blog, here I do not more than repeat, do not understand can go to the official website to browse the document for Installation.

3.2 Download the sample code

Apache Beam Source code is hosted on github, can be downloaded to github corresponding source code,: Https://github.com/apache/beam

then, The sample code is packaged, and the command is as Follows:

 $ mvn archetype:generate -darchetyperepository=https:// repository.apache.org/content/groups/snapshots \ - Darchetypegroupid=org.apache.beam -darchetypeartifactid= Beam-sdks-java-maven-archetypes-examples -darchetypeversion=latest -dgroupid=org.example -dartifactid=word-count-beam -dversion= 0.1   "  -dpackage=org.apache.beam.examples -dinteractivemode=false  

At this point, the command creates a folder Word-count-beam that contains a pom.xml and associated code File. The commands are as Follows:

$ cd word-count-beam/lspom.xml    ls src/main/java/org/apache/beam/examples/ Debuggingwordcount.java    Windowedwordcount.java    Commonminimalwordcount.java    Wordcount.java
3.3 Running the WordCount sample code

A Beam program can run on multiple Beam executable engines, including Apexrunner,flinkrunner,sparkrunner or dataflowrunner. There are also directrunner. It can be executed locally without special configuration, so it is easy to Test.

below, You can select the engine that you want to execute the program on Demand:

    1. Configure the engine to be relevant
    2. Use different commands: specify the engine type by the--runner=<runner> parameter, The default is directrunner, add engine-related parameters, Specify the output file and output directory, of course, you need to ensure that the file directory is the execution engine can access to, For example, a local file directory cannot be accessed by an external cluster.
    3. Run the sample program
3.3.1 Direct
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount      -dexec.args=" --inputfile=pom.xml--output=counts " -pdirect-runner
3.3.2 Apex
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount      -dexec.args=" --inputfile=pom.xml--output=counts--runner=apexrunner " -papex-runner
3.3.3 flink-local
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount      -dexec.args=" --runner=flinkrunner--inputfile=pom.xml--output=counts " -pflink-runner
3.3.4 Flink-cluster
$ MVN Package exec:java-dexec.mainclass=org.apache.beam.examples.WordCount      -dexec.args=" --runner=flinkrunner--flinkmaster=<flink Master>--filestostage=target/word-count-beam-bundled-0.1.jar \                  --inputfile=/path/to/quickstart/pom.xml--output=/tmp/counts"  -pflink-runner

You can then monitor the running application by visiting the Http://<flink master>:8081.

3.3.5 Spark
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount      -dexec.args=" --runner=sparkrunner--inputfile=pom.xml--output=counts " -pspark-runner
3.3.6 Dataflow
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount      -dexec.args=" --runner=dataflowrunner--gcptemplocation=gs://<your-gcs-bucket>/tmp \                  --inputfile=gs://apache-beam-samples/shakespeare/*--output=gs://< Your-gcs-bucket>/counts "\     -pdataflow-runner
3.4 Running Results

When the program finishes running, you can see that there are multiple files starting with count, depending on the type of execution Engine. When you look at the contents of a file, the number of occurrences is displayed after each unique word, but the order is not fixed and is a common way for the distributed engine to improve its efficiency.

3.4.1 Direct
ls counts*more counts*914     2 1 11 ...
3.4.2 Apex
cat counts*11141 ...
3.4.3 flink-local
ls counts*more counts*194     2 1 11 ...
3.4.4 Flink-cluster
ls /tmp/counts* more/tmp/counts*194  2 1 1 1 ... 
3.4.5 Spark
ls counts* more counts* 1 1   11 110 ...
3.4.6 Dataflow
$ gsutillsGs://<your-gcs-bucket>/counts*$ gsutilCatGs://<your-gcs-bucket>/counts*Feature thesmother'st:1Revelry:1bashfulness:1Bashful:1Below:2deserves: +barrenly:1...
4. Summary

Apache Beam is mainly for ideal parallel data processing tasks, and by splitting the dataset into multiple sub-datasets, each sub-dataset can be processed separately, thus enabling the parallelization of the overall data set. of course, You can also use Beam to handle extraction, transformation, and loading tasks and data integration Tasks (an ETL Process). The data is further read from different storage media or data sources, converted into data formats, and finally loaded into the new System.

5. Concluding REMARKS

This blog is to share with you here, if you study in the process of learning what is the problem, you can add groups to discuss or send e-mail to me, I will do my best to answer for you, with June encouragement!

Apache Beam Anatomy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.