Apache Beam Anatomy

Last Update:2017-04-09 Source: Internet

Author: User

Tags apache flink apache beam

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. overview

In the wave of big data, technology updates iterate very frequently. Big data developers have a wealth of tools that are influenced by technology open source. But it also makes it more difficult for developers to choose the right Tool. When it comes to big data processing problems, the techniques used are often diversified. It all depends on the business requirements, such as MapReduce for batching, Flink for real-time streaming, Spark SQL for SQL interaction, and so On. It is conceivable that these open source frameworks, tools, libraries, and platforms can be combined with the required workload and Complexity. This is also a headache for big data developers. What is to be shared today is a solution to the integration of these resources, which is Apache Beam.

2. content

Apache Beam, originally called Apache Dataflow, donated a lot of core code to Apache by Google and its partners, and was created to hatch the Project. Most of the large size of the project comes from the Cloud Dataflow SDK, which features the following points:

Paradigm of unified data batching (batch) and stream processing (stream) programming
Can run on any executable engine

That Apache beam can solve the problem, what is its application scenario, below we can be illustrated by a graph, as shown in:

By changing the map, we can clearly see the development of the entire technology flow, part of the Google faction, the other part is the Apache Faction. When developing big data applications, We sometimes use Google's frameworks, apis, libraries, platforms, and so on, and sometimes we use apache, such as Hbase,flink,spark. And we need to integrate these resources is a headache problem, Apache Beam, The integration of these resources provides a very convenient solution.

2.1 Vision

below, we look at Beam's running flow through a flowchart, as shown in:

through, we can clearly know that the execution of a process is divided into the following steps:

End Users: Choose a programming language you are familiar with submitting your app
SDK Writers: The programming language must be supported by the Beam model
Library writers: format converted to beam model
Runner writers: processing and supporting beam data processing pipelines in a distributed environment
IO Providers: Run all applications on the beam data processing pipeline
DSL writers: Creating a high-order data processing pipeline

2.2 SDK

The Beam SDK provides a unified programming model to handle datasets of any size, including limited data sets and unlimited streaming Data. The Apache Beam SDK uses the same classes to express limited and unlimited data, as well as to manipulate the data using the same transformation method. Beam provides a variety of sdks, you can choose a familiar to build a data processing pipeline, such as the above 2.1 figure, we can know that the current Beam support Java,python and other languages to be developed.

2.3 Pipeline Runners

Running the engine on the Beam pipeline is based on the distributed processing engine of your choice, where the compatible API transforms your Beam application, allowing your Beam application to run efficiently on the specified distributed processing ENGINE. thus, when running the Beam program, you can choose a distributed processing engine according to your own NEEDS. The current Beam supports pipeline run engines in the following ways:

Apache Apex
Apache Flink
Apache Spark
Google Cloud Dataflow

3. Example

This example is done by using the Java SDK, which you can try to run on a different execution engine.

3.1 Development environment

Download Install JDK 7 or later to detect java_home environment variables
Download the Maven packaging Environment.

About the above installation steps, is not the focus of this blog, here I do not more than repeat, do not understand can go to the official website to browse the document for Installation.

3.2 Download the sample code

Apache Beam Source code is hosted on github, can be downloaded to github corresponding source code,: Https://github.com/apache/beam

then, The sample code is packaged, and the command is as Follows:

 $ mvn archetype:generate -darchetyperepository=https:// repository.apache.org/content/groups/snapshots \ - Darchetypegroupid=org.apache.beam -darchetypeartifactid= Beam-sdks-java-maven-archetypes-examples -darchetypeversion=latest -dgroupid=org.example -dartifactid=word-count-beam -dversion= 0.1   "  -dpackage=org.apache.beam.examples -dinteractivemode=false

At this point, the command creates a folder Word-count-beam that contains a pom.xml and associated code File. The commands are as Follows:

$ cd word-count-beam/lspom.xml    ls src/main/java/org/apache/beam/examples/ Debuggingwordcount.java    Windowedwordcount.java    Commonminimalwordcount.java    Wordcount.java

3.3 Running the WordCount sample code

A Beam program can run on multiple Beam executable engines, including Apexrunner,flinkrunner,sparkrunner or dataflowrunner. There are also directrunner. It can be executed locally without special configuration, so it is easy to Test.

below, You can select the engine that you want to execute the program on Demand:

Configure the engine to be relevant
Use different commands: specify the engine type by the--runner=<runner> parameter, The default is directrunner, add engine-related parameters, Specify the output file and output directory, of course, you need to ensure that the file directory is the execution engine can access to, For example, a local file directory cannot be accessed by an external cluster.
Run the sample program

3.3.1 Direct

$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount      -dexec.args=" --inputfile=pom.xml--output=counts " -pdirect-runner

3.3.2 Apex

$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount      -dexec.args=" --inputfile=pom.xml--output=counts--runner=apexrunner " -papex-runner

3.3.3 flink-local

$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount      -dexec.args=" --runner=flinkrunner--inputfile=pom.xml--output=counts " -pflink-runner

3.3.4 Flink-cluster

$ MVN Package exec:java-dexec.mainclass=org.apache.beam.examples.WordCount      -dexec.args=" --runner=flinkrunner--flinkmaster=<flink Master>--filestostage=target/word-count-beam-bundled-0.1.jar \                  --inputfile=/path/to/quickstart/pom.xml--output=/tmp/counts"  -pflink-runner

You can then monitor the running application by visiting the Http://<flink master>:8081.

3.3.5 Spark

$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount      -dexec.args=" --runner=sparkrunner--inputfile=pom.xml--output=counts " -pspark-runner

3.3.6 Dataflow

$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount      -dexec.args=" --runner=dataflowrunner--gcptemplocation=gs://<your-gcs-bucket>/tmp \                  --inputfile=gs://apache-beam-samples/shakespeare/*--output=gs://< Your-gcs-bucket>/counts "\     -pdataflow-runner

3.4 Running Results

When the program finishes running, you can see that there are multiple files starting with count, depending on the type of execution Engine. When you look at the contents of a file, the number of occurrences is displayed after each unique word, but the order is not fixed and is a common way for the distributed engine to improve its efficiency.

3.4.1 Direct

ls counts*more counts*914     2 1 11 ...

3.4.2 Apex

cat counts*11141 ...

3.4.3 flink-local

ls counts*more counts*194     2 1 11 ...

3.4.4 Flink-cluster

ls /tmp/counts* more/tmp/counts*194  2 1 1 1 ...

3.4.5 Spark

ls counts* more counts* 1 1   11 110 ...

3.4.6 Dataflow

$ gsutillsGs://<your-gcs-bucket>/counts*$ gsutilCatGs://<your-gcs-bucket>/counts*Feature thesmother'st:1Revelry:1bashfulness:1Bashful:1Below:2deserves: +barrenly:1...

4. Summary

Apache Beam is mainly for ideal parallel data processing tasks, and by splitting the dataset into multiple sub-datasets, each sub-dataset can be processed separately, thus enabling the parallelization of the overall data set. of course, You can also use Beam to handle extraction, transformation, and loading tasks and data integration Tasks (an ETL Process). The data is further read from different storage media or data sources, converted into data formats, and finally loaded into the new System.

5. Concluding REMARKS

This blog is to share with you here, if you study in the process of learning what is the problem, you can add groups to discuss or send e-mail to me, I will do my best to answer for you, with June encouragement!

Apache Beam Anatomy

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More