Apache Beam WordCount Programming Combat and Source code interpretation

Source: Internet
Author: User
Tags apache flink apache beam

Overview: Apache Beam WordCount Programming Combat and Source code interpretation, and through IntelliJ idea and terminal two ways to debug the implementation of WordCount program, Apache Beam on Big Data batch and stream processing, Provides a set of advanced unified programming models and is capable of executing on large data processing engines. Full project GitHub source code

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvzhjlyw1fyw4=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "alt=" Apache Beam WordCount programming Combat and Source code interpretation "Title=" ">

Responsible for the company Big Data processing related architecture, but has the diversity, greatly adds the development cost, the urgent need unified programming processing, Apache Beam. A program that executes everywhere. Therefore will toss the result to share out.

1.Apache Beam Programming-preface, Apache beam features and key concepts.

Apache Beam became Apache's new top-level project on January 10, 2017.

1.1.Apache Beam Features:
    • Unified: Uses a single programming model for batch and streaming media use cases.
    • Convenient: Support multiple pipelines environment execution. Includes: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.
    • Extensible: Write and share new Sdks,io connectors and transformation libraries
      Partial translation from official website: Apacher Beam official website
1.2.Apache Beam Key concept: 1.2.1.Apache Beam SDKs

The main development API. Provides a unified programming model for batch and stream processing. Now (2017) supports the Java language. And Python is developing in a tense way.

1.2.2. Apache Beam Pipeline Runners (Beam actuators/performers). Supports Apache Apex,apache Flink. Apache Spark. Google Cloud dataflow Multiple Big Data computing frameworks. It is an Apache beam programming, multi-computational framework execution. 1.2.3. Their support for, for example, the following details

2.Apache beam Programming Combat –apache Beam Source code interpretation

Based on Maven,intellij idea. POM.XM View full project GitHub source code . You can import the full project directly through the idea's project import feature, wait for maven to download the dependency package, and then follow the interpretation steps as follows to perform smoothly.

2.1. Source code parsing-apache Beam Data Flow processing principle Analysis:

Key steps:

    • Create pipeline
    • Apply a transform to pipeline
    • Read input file
    • Apply Pardo Conversion
    • Apply the conversion provided by the SDK (for example, Count)
    • Write out the output
    • Executive Pipeline

2.2. Source code parsing. Full project GitHub source code, attached wordcount,pom.xml, etc.
/** * MIT * Author:wangxiaolei (Wang Xiaole). * date:17-2-20. * Project:apachebeamwordcount. *ImportOrg.apache.beam.sdk.Pipeline;ImportOrg.apache.beam.sdk.io.TextIO;ImportOrg.apache.beam.sdk.options.Default;ImportOrg.apache.beam.sdk.options.Description;ImportOrg.apache.beam.sdk.options.PipelineOptions;ImportOrg.apache.beam.sdk.options.PipelineOptionsFactory;Importorg.apache.beam.sdk.options.Validation.Required;ImportOrg.apache.beam.sdk.transforms.Aggregator;ImportOrg.apache.beam.sdk.transforms.Count;ImportOrg.apache.beam.sdk.transforms.DoFn;Importorg.apache.beam.sdk.transforms.MapElements;ImportOrg.apache.beam.sdk.transforms.PTransform;ImportOrg.apache.beam.sdk.transforms.ParDo;ImportOrg.apache.beam.sdk.transforms.SimpleFunction;ImportOrg.apache.beam.sdk.transforms.Sum;ImportOrg.apache.beam.sdk.values.KV;ImportOrg.apache.beam.sdk.values.pcollection;public class WordCount {    /** *1.a. Dofn Programming Pipeline makes the code very concise.

B. The input text to do Word division, output.

*/Static class extractwordsfn extends dofn<string, string > { Private FinalAggregator<long, long> emptylines = Createaggregator ("Emptylines", Sum.oflongs ());@ProcessElementpublic void Processelement (Processcontext c) {if(C.element (). Trim (). IsEmpty ()) {Emptylines.addvalue (1L); }//divide lines of text into wordsstring[] Words = C.element (). Split ("[^a-za-z ']+");//Output the word in pcollection for(String word:words) {if(!word.isempty ()) {c.output (word); } } } }/**. Formats the input text data, converting the word to and counting the printed string. */public static class formatastextfn extends simplefunction<KV<String , Long;, String> { @OverridePublic String Apply (kv<string, long> input) {returnInput.getkey () +": "+ Input.getvalue (); } }/**. Word count, Ptransform (pcollection Transform) converts the pcollection text line to a formatted, counted word.

*/public static class countwords extends ptransform<pcollection<String ,Pcollection<kv<string, long>>> {@OverridePublic pcollection<kv<string, long>> expand (pcollection<string> lines) {//Convert lines of text to a single wordpcollection<string> words = lines.apply (Pardo.of (NewEXTRACTWORDSFN ()));//Calculate the number of words per wordpcollection<kv<string, long>> wordcounts = words.apply (Count.<string>perelement ());returnwordcounts; } }/**. Ability to define some options. For example, file input/output path * /public interface Wordcountoptionsextendspipelineoptions {/** * File input option, the path can be passed through the command line parameters, Gs://apache-beam-samples/shakespeare/kinglear.txt */ @Description("Path of the file to read from")@Default. String ("Gs://apache-beam-samples/shakespeare/kinglear.txt") String Getinputfile (); void Setinputfile (String value);/** * Set the output path of the result file, or specify the output file path in IntelliJ idea's execution settings option or on the command line, such as./pom.xml */ @Description("Path of the file to write to")@RequiredString GetOutput (); void Setoutput (String value); }/** * 5. Execute the program * /public static void Main (string[] args) {wordcountoptions options = Pipelineoptionsfactory.fromargs (args). Withvalid Ation (). As (wordcountoptions.class); Pipeline p = pipeline.create (options); P.apply ("ReadLines", TextIO.Read.from (Options.getinputfile ())). Apply (NewCountWords ()). Apply (Mapelements.via (NewFORMATASTEXTFN ()). Apply ("Writecounts", TextIO.Write.to (Options.getoutput ())); P.run (). Waituntilfinish (); }}

3. Support Spark. Flink,apex and other big Data data framework to execute the WORDCOUNT program. full project GitHub source code (recommended, note the pom.xml success of module loading, in the tool to develop big data programs, facilitate debugging, development experience better)3.1.intellij Idea (Community edition) in the Spark Big Data Framework execution Pipeline calculation program
    • Spark execution

      • Set VM Options

        -DPspark-runner
      • Set programe arguments

        --inputFile=pom.xml--output=counts

Apex in 3.2.intellij Idea (Community edition). Flink and other supported Big data framework can execute WordCount pipeline calculation program, full project GitHub source code
  • Apex Execution

    • Set VM options

      -dpapex  -runner   
    • Settings programe Arguments

      - - inputfile=pom   xml  - - Span class= "hljs-comment" >output=counts   
  • Flink execution, etc.

    • Set VM Options

      -DPflink-runner
    • Set programe arguments

      --inputFile=pom.xml--output=counts
4. Terminal Execution (Terminal) (not recommended, the first download process is very slow.) Poor development experience) 4.1. The following command is a download officialDemonstrates the sample source code. The first time you perform a slow download, assume that you have failed a few more times. (Recommended download, full project GitHub source code) directly with the above-mentioned interpretation in IntelliJ idea implementation.
mvn archetype:generate       -DarchetypeRepository=https://repository.apache.org/content/groups/snapshots       -DarchetypeGroupId=org.apache.beam       -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples       -DarchetypeVersion=LATEST       -DgroupId=org.example       -DartifactId=word-count-beam       -Dversion="0.1"       -Dpackage=org.apache.beam.examples       -DinteractiveMode=false

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvzhjlyw1fyw4=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "alt=" Apache Beam WordCount programming Combat and Source code interpretation "Title=" ">

4.2. Package and Execute
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount      -Dexec.args="--runner=SparkRunner --inputFile=pom.xml --output=counts" -Pspark-runner

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvzhjlyw1fyw4=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "alt=" Apache Beam WordCount programming Combat and Source code interpretation "Title=" ">

4.3. Successful execution of results 4.3.1. Show successful execution

4.3.2.WordCount Output Calculation results

Apache Beam WordCount Programming Combat and Source code interpretation

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.