Article Luxianghao
Article Source: http://www.cnblogs.com/luxianghao/p/9010748.html reprint Please specify, thank you for your cooperation.
Disclaimer: The content of the article represents only personal opinions, if not, please correct me.
---
An introduction
In February 2016, Google announced that Beam (formerly Google DataFlow) had contributed to the Apache Foundation to hatch and become a top-level open source project for Apache.
Beam is a unified programming framework that supports batch processing and stream processing, and can be built using the beam programming model in multiple compute engines (Apache Apex, Apache Flink, Apache Spark, Google Cloud Dataflow, etc.) run on.
Big data originated in three papers released by Google in 2003, GoogleFS, MapReduce, BigTable, history, said Troika, unfortunately, Google published in the paper and did not publish its source code, but the Apache open source community to flourish, Has appeared Hadoop,spark,apache Flink and other products, and Google internal use of closed source bigtable, Spanner, Millwheel, but they the same, this time Google did not send a paper and then disappeared, But high-profile open source beam, the so-called "first-class companies set standards," the benefits of open source is quite many, such as improve the company's influence, brainstorming, common maintenance and so on.
Two beam advantages
1 unification
Beam provides a unified programming model, programming guidance can refer to the official programming-guide, through Quickstart-java and Wordcount-example get started.
2 Portable
Programmers do not need to focus on the coding of the future code will run to spark, Flink or other computing platform, coding the completion of the command line to select the computing platform.
3 Expandable
As written in 2 portability , Directrunner and Sparkrunner have been shown, in beam, RUNNER,IO linker, conversion Operations Library, and even the SDK can be customized, with a high degree of extensibility.
4 support for batch and stream processing
Programs written by Beam can be executed without modification, regardless of whether the programmer writes with beam for batch or stream processing, for a limited dataset or for an unlimited set of data. (Beam to solve this problem by introducing the concept of triggering,windows)
5 High Abstraction
Beam is highly abstracted with a dag (directed acyclic graph), where programmers do not need to force their code into Map-shuffle-reduce form, and can perform higher-level operations directly, such as counting, joining, Projecting.
6 Multi-lingual support
Java and Python are now officially supported, and more languages will be developed later in the SDK.
Three beam composition
Let's start with a whole frame diagram.
1 Beam programming model
Beam's programming model is the abstraction of Google's engineers from a number of big data-processing projects such as MapReduce, Flumejava, and Millwheel, and if you want to learn more about them, you can refer to the relevant candidates and papers, streaming 101,streaming 102 and VLDB paper. The programming model consists of several core concepts, including the following:
- Pcollection: A dataset that represents the set of data that will be processed, either a limited set of data or an infinite stream of data.
- Ptransform: The computational process, which represents the process of processing an input dataset into an output data set,
- Pipeline: The pipeline, which represents the execution task of processing data, is visualized as a directed acyclic graph (DAG), Pcollections is a node, and transforms is an edge.
- Pipelinerunner: Actuator that specifies where and how the pipeline will run.
Ptransform also includes many operations, such as:
- ParDo: Generic parallel processing of ptranform, equivalent to the map in Map/shuffle/reduce-style, can be used for filtering, type conversion, extracting part of the data, calculating each element in the data, etc.
- Groupbykey: Used to aggregate key/value pairs, equivalent to Shuffle in Map/shuffle/reduce-style, aggregating the value of those with the same key
- Cogroupbykey: Used to aggregate multiple collections, functions and Groupbykey are similar
- Combine: Processing data in a collection, such as Sum, Min, and Max (SDK predefined), or self-building a new class
- Flatten: Used to combine multiple datasets into a single data set
- Partition: Used to split a data set into multiple small datasets
In addition, there are core concepts such as:
- Windowing: Dividing elements of the pcollections dataset into subsets by timestamps
- Watermark: How long after the delay data has been marked directly discarded
- Triggers: Used to determine when to send aggregated results for each window
The beam programming model can be simply summed up as
[Output pcollection] = [Input pcollection].apply ([Transform])
Google engineers also abstracted the scene of beam programming into four questions, which is WWWH
What is the calculation, the corresponding abstract concept is Ptransform
That is, in which timeframe, the corresponding abstract concept is window
That is, when the calculation results are output, the corresponding abstract concepts are watermarks and triggers
That is, how to extract the relevant data, the corresponding abstract concept is accumulation
Note: The translation here is based on the reference to streaming 102, may be purely literal translation does not achieve the desired effect, if there is inappropriate place to welcome correct.
2 SDK
Beam supports the use of multiple-language SDKs to construct the pipeline, which currently supports Java and Python, which is relatively better for Java SDK support.
3 Runner
Beam supports running pipeline on multiple distributed backend and currently supports the following pipelinerunners:
- Directrunner: Execute pipeline locally
- Apexrunner: Run on yarn cluster (or in embeded mode) pipeline
- Dataflowrunner: Run on Google Cloud dataflow pipleine
- Flinkrunner: Running on a flink cluster pipeline
- Sparkrunner: Running pipeline on the spark cluster
- Mrrunner: Currently in the beam GitHub main branch has not yet provided, but has Mr-runner branch, the specific also can refer to BEAM-165
Four examples
Through the official WordCount example to actually experience the next beam, detailed can refer to Quickstart-java and wordcount-example.
1 Getting the relevant code
mvn archetype:generate -darchetypegroupid=org.apache.beam -darchetypeartifactid=beam-sdks-java-maven-archetypes- Examples -darchetypeversion=2.1. 0 -dgroupid= org.example-dartifactid=word-count-beam -dversion= " 0.1 " -dpackage=org.apache.beam.examples -dinteractivemode=false
2 Related Documents
$ cd word-count-beam/lspom.xml ls src/main/java/org/apache/beam/examples/ Debuggingwordcount.java Windowedwordcount.java Commonminimalwordcount.java Wordcount.java
3 Execution with Drectrunner
MVN compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount-dexec.args="--inputfile= Pom.xml--output=counts"
4 Submit to spark
Mode 1
MVN compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount-dexec.args="--runner= Sparkrunner--inputfile=pom.xml--output=counts" -pspark-runner
Mode 2
Spark-submit--class org.apache.beam.examples.WordCount--master local target/word-count-beam-bundled-0.1. Jar--runner=sparkrunner--inputfile=pom.xml--output=counts
Mode 3
Spark-submit--class org.apache.beam.examples.WordCount--master yarn--deploy-mode cluster word-count-beam-bundled- 0.1. Jar--runner=sparkrunner--inputfile=/home/yarn/software/java/license
Sparkrunner details here, where mode 3 read the HDFs file, there will be some problems, this problem we will say here, the above example can be written on the physical machine on the actual existence of the file, so that the program can be used to ensure proper operation.
Five references
Programming Guide Https://beam.apache.org/documentation/programming-guide
Example https://beam.apache.org/get-started/wordcount-example/
Javadoc https://beam.apache.org/documentation/sdks/javadoc/2.1.0/
Streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
streaming-102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
A preliminary study of Apache Beam