Overview: Apache Beam WordCount Programming Combat and Source code interpretation, and through IntelliJ idea and terminal two ways to debug the implementation of WordCount program, Apache Beam on Big Data batch and stream processing, Provides a set of advanced unified programming models and is capable of executing on large data processing engines. Full project GitHub source code
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvzhjlyw1fyw4=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "alt=" Apache Beam WordCount programming Combat and Source code interpretation "Title=" ">
Responsible for the company Big Data processing related architecture, but has the diversity, greatly adds the development cost, the urgent need unified programming processing, Apache Beam. A program that executes everywhere. Therefore will toss the result to share out.
1.Apache Beam Programming-preface, Apache beam features and key concepts.
Apache Beam became Apache's new top-level project on January 10, 2017.
1.1.Apache Beam Features:
- Unified: Uses a single programming model for batch and streaming media use cases.
- Convenient: Support multiple pipelines environment execution. Includes: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.
- Extensible: Write and share new Sdks,io connectors and transformation libraries
Partial translation from official website: Apacher Beam official website
1.2.Apache Beam Key concept: 1.2.1.Apache Beam SDKs
The main development API. Provides a unified programming model for batch and stream processing. Now (2017) supports the Java language. And Python is developing in a tense way.
1.2.2. Apache Beam Pipeline Runners (Beam actuators/performers). Supports Apache Apex,apache Flink. Apache Spark. Google Cloud dataflow Multiple Big Data computing frameworks. It is an Apache beam programming, multi-computational framework execution. 1.2.3. Their support for, for example, the following details
2.Apache beam Programming Combat –apache Beam Source code interpretation
Based on Maven,intellij idea. POM.XM View full project GitHub source code . You can import the full project directly through the idea's project import feature, wait for maven to download the dependency package, and then follow the interpretation steps as follows to perform smoothly.
2.1. Source code parsing-apache Beam Data Flow processing principle Analysis:
Key steps:
- Create pipeline
- Apply a transform to pipeline
- Read input file
- Apply Pardo Conversion
- Apply the conversion provided by the SDK (for example, Count)
- Write out the output
- Executive Pipeline
2.2. Source code parsing. Full project GitHub source code, attached wordcount,pom.xml, etc.
/** * MIT * Author:wangxiaolei (Wang Xiaole). * date:17-2-20. * Project:apachebeamwordcount. *ImportOrg.apache.beam.sdk.Pipeline;ImportOrg.apache.beam.sdk.io.TextIO;ImportOrg.apache.beam.sdk.options.Default;ImportOrg.apache.beam.sdk.options.Description;ImportOrg.apache.beam.sdk.options.PipelineOptions;ImportOrg.apache.beam.sdk.options.PipelineOptionsFactory;Importorg.apache.beam.sdk.options.Validation.Required;ImportOrg.apache.beam.sdk.transforms.Aggregator;ImportOrg.apache.beam.sdk.transforms.Count;ImportOrg.apache.beam.sdk.transforms.DoFn;Importorg.apache.beam.sdk.transforms.MapElements;ImportOrg.apache.beam.sdk.transforms.PTransform;ImportOrg.apache.beam.sdk.transforms.ParDo;ImportOrg.apache.beam.sdk.transforms.SimpleFunction;ImportOrg.apache.beam.sdk.transforms.Sum;ImportOrg.apache.beam.sdk.values.KV;ImportOrg.apache.beam.sdk.values.pcollection;public class WordCount { /** *1.a. Dofn Programming Pipeline makes the code very concise.B. The input text to do Word division, output.
*/Static class extractwordsfn extends dofn<string, string > { Private FinalAggregator<long, long> emptylines = Createaggregator ("Emptylines", Sum.oflongs ());@ProcessElementpublic void Processelement (Processcontext c) {if(C.element (). Trim (). IsEmpty ()) {Emptylines.addvalue (1L); }//divide lines of text into wordsstring[] Words = C.element (). Split ("[^a-za-z ']+");//Output the word in pcollection for(String word:words) {if(!word.isempty ()) {c.output (word); } } } }/**. Formats the input text data, converting the word to and counting the printed string. */public static class formatastextfn extends simplefunction<KV<String , Long;, String> { @OverridePublic String Apply (kv<string, long> input) {returnInput.getkey () +": "+ Input.getvalue (); } }/**. Word count, Ptransform (pcollection Transform) converts the pcollection text line to a formatted, counted word.
*/public static class countwords extends ptransform<pcollection<String ,Pcollection<kv<string, long>>> {@OverridePublic pcollection<kv<string, long>> expand (pcollection<string> lines) {//Convert lines of text to a single wordpcollection<string> words = lines.apply (Pardo.of (NewEXTRACTWORDSFN ()));//Calculate the number of words per wordpcollection<kv<string, long>> wordcounts = words.apply (Count.<string>perelement ());returnwordcounts; } }/**. Ability to define some options. For example, file input/output path * /public interface Wordcountoptionsextendspipelineoptions {/** * File input option, the path can be passed through the command line parameters, Gs://apache-beam-samples/shakespeare/kinglear.txt */ @Description("Path of the file to read from")@Default. String ("Gs://apache-beam-samples/shakespeare/kinglear.txt") String Getinputfile (); void Setinputfile (String value);/** * Set the output path of the result file, or specify the output file path in IntelliJ idea's execution settings option or on the command line, such as./pom.xml */ @Description("Path of the file to write to")@RequiredString GetOutput (); void Setoutput (String value); }/** * 5. Execute the program * /public static void Main (string[] args) {wordcountoptions options = Pipelineoptionsfactory.fromargs (args). Withvalid Ation (). As (wordcountoptions.class); Pipeline p = pipeline.create (options); P.apply ("ReadLines", TextIO.Read.from (Options.getinputfile ())). Apply (NewCountWords ()). Apply (Mapelements.via (NewFORMATASTEXTFN ()). Apply ("Writecounts", TextIO.Write.to (Options.getoutput ())); P.run (). Waituntilfinish (); }}
3. Support Spark. Flink,apex and other big Data data framework to execute the WORDCOUNT program.
full project GitHub source code (recommended, note the pom.xml
success of module loading, in the tool to develop big data programs, facilitate debugging, development experience better)3.1.intellij Idea (Community edition) in the Spark Big Data Framework execution Pipeline calculation program
Apex in 3.2.intellij Idea (Community edition). Flink and other supported Big data framework can execute WordCount pipeline calculation program, full project GitHub source code
-
Apex Execution
Flink execution, etc.
4. Terminal Execution (Terminal) (not recommended, the first download process is very slow.) Poor development experience) 4.1. The following command is a download
officialDemonstrates the sample source code. The first time you perform a slow download, assume that you have failed a few more times. (Recommended download, full project GitHub source code) directly with the above-mentioned interpretation in IntelliJ idea implementation.
mvn archetype:generate -DarchetypeRepository=https://repository.apache.org/content/groups/snapshots -DarchetypeGroupId=org.apache.beam -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples -DarchetypeVersion=LATEST -DgroupId=org.example -DartifactId=word-count-beam -Dversion="0.1" -Dpackage=org.apache.beam.examples -DinteractiveMode=false
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvzhjlyw1fyw4=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "alt=" Apache Beam WordCount programming Combat and Source code interpretation "Title=" ">
4.2. Package and Execute
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--runner=SparkRunner --inputFile=pom.xml --output=counts" -Pspark-runner
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvzhjlyw1fyw4=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "alt=" Apache Beam WordCount programming Combat and Source code interpretation "Title=" ">
4.3. Successful execution of results 4.3.1. Show successful execution
4.3.2.WordCount Output Calculation results
Apache Beam WordCount Programming Combat and Source code interpretation