Apache Beam WordCount Programming Combat and Source code interpretation

Last Update:2018-03-31 Source: Internet

Author: User

Tags apache flink apache beam

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Overview: Apache Beam WordCount Programming Combat and Source code interpretation, and through IntelliJ idea and terminal two ways to debug the implementation of WordCount program, Apache Beam on Big Data batch and stream processing, Provides a set of advanced unified programming models and is capable of executing on large data processing engines. Full project GitHub source code

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvzhjlyw1fyw4=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "alt=" Apache Beam WordCount programming Combat and Source code interpretation "Title=" ">

Responsible for the company Big Data processing related architecture, but has the diversity, greatly adds the development cost, the urgent need unified programming processing, Apache Beam. A program that executes everywhere. Therefore will toss the result to share out.

1.Apache Beam Programming-preface, Apache beam features and key concepts.

Apache Beam became Apache's new top-level project on January 10, 2017.

1.1.Apache Beam Features:

Unified: Uses a single programming model for batch and streaming media use cases.
Convenient: Support multiple pipelines environment execution. Includes: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.
Extensible: Write and share new Sdks,io connectors and transformation libraries
Partial translation from official website: Apacher Beam official website

1.2.Apache Beam Key concept: 1.2.1.Apache Beam SDKs

The main development API. Provides a unified programming model for batch and stream processing. Now (2017) supports the Java language. And Python is developing in a tense way.

1.2.2. Apache Beam Pipeline Runners (Beam actuators/performers). Supports Apache Apex,apache Flink. Apache Spark. Google Cloud dataflow Multiple Big Data computing frameworks. It is an Apache beam programming, multi-computational framework execution. 1.2.3. Their support for, for example, the following details

2.Apache beam Programming Combat –apache Beam Source code interpretation

Based on Maven,intellij idea. POM.XM View full project GitHub source code . You can import the full project directly through the idea's project import feature, wait for maven to download the dependency package, and then follow the interpretation steps as follows to perform smoothly.

2.1. Source code parsing-apache Beam Data Flow processing principle Analysis:

Key steps:

Create pipeline
Apply a transform to pipeline
Read input file
Apply Pardo Conversion
Apply the conversion provided by the SDK (for example, Count)
Write out the output
Executive Pipeline

2.2. Source code parsing. Full project GitHub source code, attached wordcount,pom.xml, etc.

/** * MIT * Author:wangxiaolei (Wang Xiaole). * date:17-2-20. * Project:apachebeamwordcount. *ImportOrg.apache.beam.sdk.Pipeline;ImportOrg.apache.beam.sdk.io.TextIO;ImportOrg.apache.beam.sdk.options.Default;ImportOrg.apache.beam.sdk.options.Description;ImportOrg.apache.beam.sdk.options.PipelineOptions;ImportOrg.apache.beam.sdk.options.PipelineOptionsFactory;Importorg.apache.beam.sdk.options.Validation.Required;ImportOrg.apache.beam.sdk.transforms.Aggregator;ImportOrg.apache.beam.sdk.transforms.Count;ImportOrg.apache.beam.sdk.transforms.DoFn;Importorg.apache.beam.sdk.transforms.MapElements;ImportOrg.apache.beam.sdk.transforms.PTransform;ImportOrg.apache.beam.sdk.transforms.ParDo;ImportOrg.apache.beam.sdk.transforms.SimpleFunction;ImportOrg.apache.beam.sdk.transforms.Sum;ImportOrg.apache.beam.sdk.values.KV;ImportOrg.apache.beam.sdk.values.pcollection;public class WordCount {    /** *1.a. Dofn Programming Pipeline makes the code very concise.
B. The input text to do Word division, output.
*/Static class extractwordsfn extends dofn<string, string > { Private FinalAggregator<long, long> emptylines = Createaggregator ("Emptylines", Sum.oflongs ());@ProcessElementpublic void Processelement (Processcontext c) {if(C.element (). Trim (). IsEmpty ()) {Emptylines.addvalue (1L); }//divide lines of text into wordsstring[] Words = C.element (). Split ("[^a-za-z ']+");//Output the word in pcollection  for(String word:words) {if(!word.isempty ()) {c.output (word); } } } }/**. Formats the input text data, converting the word to and counting the printed string. */public static class formatastextfn extends simplefunction<KV<String , Long;, String> { @OverridePublic String Apply (kv<string, long> input) {returnInput.getkey () +": "+ Input.getvalue (); } }/**. Word count, Ptransform (pcollection Transform) converts the pcollection text line to a formatted, counted word. */public static class countwords extends ptransform<pcollection<String ,Pcollection<kv<string, long>>> {@OverridePublic pcollection<kv<string, long>> expand (pcollection<string> lines) {//Convert lines of text to a single wordpcollection<string> words = lines.apply (Pardo.of (NewEXTRACTWORDSFN ()));//Calculate the number of words per wordpcollection<kv<string, long>> wordcounts = words.apply (Count.<string>perelement ());returnwordcounts; } }/**. Ability to define some options. For example, file input/output path * /public interface Wordcountoptionsextendspipelineoptions {/** * File input option, the path can be passed through the command line parameters, Gs://apache-beam-samples/shakespeare/kinglear.txt */ @Description("Path of the file to read from")@Default. String ("Gs://apache-beam-samples/shakespeare/kinglear.txt") String Getinputfile (); void Setinputfile (String value);/** * Set the output path of the result file, or specify the output file path in IntelliJ idea's execution settings option or on the command line, such as./pom.xml */ @Description("Path of the file to write to")@RequiredString GetOutput (); void Setoutput (String value); }/** * 5. Execute the program * /public static void Main (string[] args) {wordcountoptions options = Pipelineoptionsfactory.fromargs (args). Withvalid Ation (). As (wordcountoptions.class); Pipeline p = pipeline.create (options); P.apply ("ReadLines", TextIO.Read.from (Options.getinputfile ())). Apply (NewCountWords ()). Apply (Mapelements.via (NewFORMATASTEXTFN ()). Apply ("Writecounts", TextIO.Write.to (Options.getoutput ())); P.run (). Waituntilfinish (); }}

3. Support Spark. Flink,apex and other big Data data framework to execute the WORDCOUNT program. full project GitHub source code (recommended, note the pom.xml success of module loading, in the tool to develop big data programs, facilitate debugging, development experience better)3.1.intellij Idea (Community edition) in the Spark Big Data Framework execution Pipeline calculation program

Spark execution
- Set VM Options
```
-DPspark-runner
```
- Set programe arguments
```
--inputFile=pom.xml--output=counts
```

Apex in 3.2.intellij Idea (Community edition). Flink and other supported Big data framework can execute WordCount pipeline calculation program, full project GitHub source code

Apex Execution

Set VM options
```
-dpapex  -runner   
```

Settings programe Arguments

- - inputfile=pom   xml  - - Span class= "hljs-comment" >output=counts

Flink execution, etc.
- Set VM Options
```
-DPflink-runner
```
- Set programe arguments
```
--inputFile=pom.xml--output=counts
```

4. Terminal Execution (Terminal) (not recommended, the first download process is very slow.) Poor development experience) 4.1. The following command is a download officialDemonstrates the sample source code. The first time you perform a slow download, assume that you have failed a few more times. (Recommended download, full project GitHub source code) directly with the above-mentioned interpretation in IntelliJ idea implementation.

mvn archetype:generate       -DarchetypeRepository=https://repository.apache.org/content/groups/snapshots       -DarchetypeGroupId=org.apache.beam       -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples       -DarchetypeVersion=LATEST       -DgroupId=org.example       -DartifactId=word-count-beam       -Dversion="0.1"       -Dpackage=org.apache.beam.examples       -DinteractiveMode=false

4.2. Package and Execute

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount      -Dexec.args="--runner=SparkRunner --inputFile=pom.xml --output=counts" -Pspark-runner

4.3. Successful execution of results 4.3.1. Show successful execution

4.3.2.WordCount Output Calculation results

Apache Beam WordCount Programming Combat and Source code interpretation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More