1. overview
In the wave of big data, technology updates iterate very frequently. Big data developers have a wealth of tools that are influenced by technology open source. But it also makes it more difficult for developers to choose the right Tool. When it comes to big data processing problems, the techniques used are often diversified. It all depends on the business requirements, such as MapReduce for batching, Flink for real-time streaming, Spark SQL for SQL interaction, and so On. It is conceivable that these open source frameworks, tools, libraries, and platforms can be combined with the required workload and Complexity. This is also a headache for big data developers. What is to be shared today is a solution to the integration of these resources, which is Apache Beam.
2. content
Apache Beam, originally called Apache Dataflow, donated a lot of core code to Apache by Google and its partners, and was created to hatch the Project. Most of the large size of the project comes from the Cloud Dataflow SDK, which features the following points:
- Paradigm of unified data batching (batch) and stream processing (stream) programming
- Can run on any executable engine
That Apache beam can solve the problem, what is its application scenario, below we can be illustrated by a graph, as shown in:
By changing the map, we can clearly see the development of the entire technology flow, part of the Google faction, the other part is the Apache Faction. When developing big data applications, We sometimes use Google's frameworks, apis, libraries, platforms, and so on, and sometimes we use apache, such as Hbase,flink,spark. And we need to integrate these resources is a headache problem, Apache Beam, The integration of these resources provides a very convenient solution.
2.1 Vision
below, we look at Beam's running flow through a flowchart, as shown in:
through, we can clearly know that the execution of a process is divided into the following steps:
- End Users: Choose a programming language you are familiar with submitting your app
- SDK Writers: The programming language must be supported by the Beam model
- Library writers: format converted to beam model
- Runner writers: processing and supporting beam data processing pipelines in a distributed environment
- IO Providers: Run all applications on the beam data processing pipeline
- DSL writers: Creating a high-order data processing pipeline
2.2 SDK
The Beam SDK provides a unified programming model to handle datasets of any size, including limited data sets and unlimited streaming Data. The Apache Beam SDK uses the same classes to express limited and unlimited data, as well as to manipulate the data using the same transformation method. Beam provides a variety of sdks, you can choose a familiar to build a data processing pipeline, such as the above 2.1 figure, we can know that the current Beam support Java,python and other languages to be developed.
2.3 Pipeline Runners
Running the engine on the Beam pipeline is based on the distributed processing engine of your choice, where the compatible API transforms your Beam application, allowing your Beam application to run efficiently on the specified distributed processing ENGINE. thus, when running the Beam program, you can choose a distributed processing engine according to your own NEEDS. The current Beam supports pipeline run engines in the following ways:
- Apache Apex
- Apache Flink
- Apache Spark
- Google Cloud Dataflow
3. Example
This example is done by using the Java SDK, which you can try to run on a different execution engine.
3.1 Development environment
- Download Install JDK 7 or later to detect java_home environment variables
- Download the Maven packaging Environment.
About the above installation steps, is not the focus of this blog, here I do not more than repeat, do not understand can go to the official website to browse the document for Installation.
3.2 Download the sample code
Apache Beam Source code is hosted on github, can be downloaded to github corresponding source code,: Https://github.com/apache/beam
then, The sample code is packaged, and the command is as Follows:
$ mvn archetype:generate -darchetyperepository=https:// repository.apache.org/content/groups/snapshots \ - Darchetypegroupid=org.apache.beam -darchetypeartifactid= Beam-sdks-java-maven-archetypes-examples -darchetypeversion=latest -dgroupid=org.example -dartifactid=word-count-beam -dversion= 0.1 " -dpackage=org.apache.beam.examples -dinteractivemode=false
At this point, the command creates a folder Word-count-beam that contains a pom.xml and associated code File. The commands are as Follows:
$ cd word-count-beam/lspom.xml ls src/main/java/org/apache/beam/examples/ Debuggingwordcount.java Windowedwordcount.java Commonminimalwordcount.java Wordcount.java
3.3 Running the WordCount sample code
A Beam program can run on multiple Beam executable engines, including Apexrunner,flinkrunner,sparkrunner or dataflowrunner. There are also directrunner. It can be executed locally without special configuration, so it is easy to Test.
below, You can select the engine that you want to execute the program on Demand:
- Configure the engine to be relevant
- Use different commands: specify the engine type by the--runner=<runner> parameter, The default is directrunner, add engine-related parameters, Specify the output file and output directory, of course, you need to ensure that the file directory is the execution engine can access to, For example, a local file directory cannot be accessed by an external cluster.
- Run the sample program
3.3.1 Direct
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount -dexec.args=" --inputfile=pom.xml--output=counts " -pdirect-runner
3.3.2 Apex
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount -dexec.args=" --inputfile=pom.xml--output=counts--runner=apexrunner " -papex-runner
3.3.3 flink-local
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount -dexec.args=" --runner=flinkrunner--inputfile=pom.xml--output=counts " -pflink-runner
3.3.4 Flink-cluster
$ MVN Package exec:java-dexec.mainclass=org.apache.beam.examples.WordCount -dexec.args=" --runner=flinkrunner--flinkmaster=<flink Master>--filestostage=target/word-count-beam-bundled-0.1.jar \ --inputfile=/path/to/quickstart/pom.xml--output=/tmp/counts" -pflink-runner
You can then monitor the running application by visiting the Http://<flink master>:8081.
3.3.5 Spark
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount -dexec.args=" --runner=sparkrunner--inputfile=pom.xml--output=counts " -pspark-runner
3.3.6 Dataflow
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.WordCount -dexec.args=" --runner=dataflowrunner--gcptemplocation=gs://<your-gcs-bucket>/tmp \ --inputfile=gs://apache-beam-samples/shakespeare/*--output=gs://< Your-gcs-bucket>/counts "\ -pdataflow-runner
3.4 Running Results
When the program finishes running, you can see that there are multiple files starting with count, depending on the type of execution Engine. When you look at the contents of a file, the number of occurrences is displayed after each unique word, but the order is not fixed and is a common way for the distributed engine to improve its efficiency.
3.4.1 Direct
ls counts*more counts*914 2 1 11 ...
3.4.2 Apex
cat counts*11141 ...
3.4.3 flink-local
ls counts*more counts*194 2 1 11 ...
3.4.4 Flink-cluster
ls /tmp/counts* more/tmp/counts*194 2 1 1 1 ...
3.4.5 Spark
ls counts* more counts* 1 1 11 110 ...
3.4.6 Dataflow
$ gsutillsGs://<your-gcs-bucket>/counts*$ gsutilCatGs://<your-gcs-bucket>/counts*Feature thesmother'st:1Revelry:1bashfulness:1Bashful:1Below:2deserves: +barrenly:1...
4. Summary
Apache Beam is mainly for ideal parallel data processing tasks, and by splitting the dataset into multiple sub-datasets, each sub-dataset can be processed separately, thus enabling the parallelization of the overall data set. of course, You can also use Beam to handle extraction, transformation, and loading tasks and data integration Tasks (an ETL Process). The data is further read from different storage media or data sources, converted into data formats, and finally loaded into the new System.
5. Concluding REMARKS
This blog is to share with you here, if you study in the process of learning what is the problem, you can add groups to discuss or send e-mail to me, I will do my best to answer for you, with June encouragement!
Apache Beam Anatomy