Beam Programming Series Java SDK Quickstart (recommended steps for official website)

Source: Internet
Author: User
Tags apache beam

Not much to say, directly on the dry goods!

https://beam.apache.org/get-started/beam-overview/

https://beam.apache.org/get-started/quickstart-java/

Apache Beam Java SDK Quickstart

This Quickstart would walk you through executing your first Beam pipeline to run WordCount, written using Beam ' s Java SDK, On a runner of your choice.

    • Set up your development environment
    • Get the WordCount Code
    • Run WordCount
    • Inspect the results
    • Next Steps

I am here in order to facilitate the quick Start, translated and organized into Chinese.

This blog post is done by using the Java SDK, and you can try to run it on a different execution engine.

First step: Setting up the development environment
    1. Download and install the Java development Kit (JDK) version 1.7 or later. Check that the JAVA_HOME environment variable is set and points to your JDK installation directory.
    2. Follow Maven's installation Guide to download and install Apache maven for your operating system.

Step two: Get the WordCount code for the sample

The simplest way to get a copy of the WordCount pipeline code is to use the following directives to generate a simple, Beam-based WordCount sample and build Maven project:

Apache Beam Source code is hosted on GitHub, can be downloaded to github corresponding source code,: Https://github.com/apache/beam

Then, the sample code is packaged, and the command looks like this:(this is the most recent stable version) (so this is generally used)

$ mvn archetype:generate       -darchetyperepository=https://repository.apache.org/content/groups/snapshots       - Darchetypegroupid=org.apache.beam       -darchetypeartifactid=beam-sdks-java-maven-archetypes-examples       - darchetypeversion=      -dgroupid=org.example       -dartifactid=word-count-beam       -dversion= "0.1"       -dpackage=org.apache.beam.examples       -dinteractivemode=false

This is a recommended website.

$ mvn archetype:generate       -darchetypegroupid=org.apache.beam       -darchetypeartifactid= Beam-sdks-java-maven-archetypes-examples       -darchetypeversion=2.1.0       -dgroupid=org.example       - Dartifactid=word-count-beam       -dversion= "0.1"       -dpackage=org.apache.beam.examples       -dinteractivemode= False

That's because the newest bean is 2.1.0.

This creates a named word-count-beam directory that contains a simple pom.xml file and a sample pipeline that calculates the number of individual words in a text file.

$ cd word-count-beam/$ lspom.xml    src$ ls Src/main/java/org/apache/beam/examples/debuggingwordcount.java    Windowedwordcount.java    Commonminimalwordcount.java    Wordcount.java

For a detailed introduction to the concepts of Beam used in these examples, read the WordCount Example Walkthrough article. Here we focus only on how to execute WordCount.java .

Run the WordCount sample code

 a Beam program can run on multiple Beam executable engines, including Apexrunner,flinkrunner,sparkrunner or Dataflowrunner. There are also Directrunner. It can be executed locally without special configuration, so it is easy to test.

Below, you can select the engine that you want to execute the program on demand, that is, which runner after:

    1. Configure the engine to ensure that you have configured the runner correctly.
    2. Use different commands: Specify the engine type by the--runner=<runner> parameter, the default is Directrunner, add engine-related parameters, specify the output file and output directory, of course, you need to ensure that the file directory is the execution engine can access to, For example, a local file directory cannot be accessed by an external cluster.
    3. Run the sample program, your first WordCount pipeline.

Direct
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount      -dexec.args= "--inputfile=pom.xml-- Output=counts "-pdirect-runner

Apex
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount      -dexec.args= "--inputfile=pom.xml-- Output=counts--runner=apexrunner "-papex-runner

Flink-local
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount      -dexec.args= "--runner=flinkrunner- -inputfile=pom.xml--output=counts "-pflink-runner

Flink-cluster
$ MVN Package Exec:java-dexec.mainclass=org.apache.beam.examples.wordcount      -dexec.args= "--runner=flinkrunner- -flinkmaster=<flink master>--filestostage=target/word-count-beam-bundled-0.1.jar                   --inputFile=/path/to  /quickstart/pom.xml--output=/tmp/counts "-pflink-runneryou can monitor the running job by visiting the Flink dashboard at Http://<flink master>:8081

You can then monitor the running application by visiting the Http://<flink master>:8081.

Spark
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount      -dexec.args= "--runner=sparkrunner- -inputfile=pom.xml--output=counts "-pspark-runner

Dataflow
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount      -dexec.args= "--runner= Dataflowrunner--project=<your-gcp-project>                   --gcptemplocation=gs://<your-gcs-bucket>/tmp                   -- inputfile=gs://apache-beam-samples/shakespeare/*--output=gs://<your-gcs-bucket>/counts "      - Pdataflow-runner

Run results

When the program finishes running, you can see that there are multiple files starting with count, depending on the type of execution engine. When you look at the contents of a file, the number of occurrences is displayed after each unique word, but the order is not fixed and is a common way for the distributed engine to improve its efficiency.

Once the pipeline finishes running, you can view the results. You will notice that there are multiple count output files that begin with. There are several such documents that are determined by runner. This makes it convenient for runner to perform efficient distributed execution.

When you look at the contents of the file, you will see that it contains the number of occurrences of each word. The order of elements in the file may be different from what you see here. Because the Beam model usually does not guarantee the order, the runner optimizes efficiency.

Direct
$ ls counts*$ more counts*api:9bundled:1old:4apache:2the:1limitations:1foundation:1 ...
Apex
$ cat Counts*beam:1have:1simple:1skip:4passert:1 ...
Flink-local
$ ls counts*$ more counts*the:1api:9old:4apache:2limitations:1bundled:1foundation:1 ...
Flink-cluster
$ ls/tmp/counts*$ More/tmp/counts*the:1api:9old:4apache:2limitations:1bundled:1foundation:1 ...
Spark
$ ls counts*$ more counts*beam:27sf:1fat:1job:1limitations:1require:1of:11profile:10 ...
Dataflow
$ gsutil ls gs://<your-gcs-bucket>/counts*$ gsutil cat gs://<your-gcs-bucket>/counts*feature:15smother ' St:1revelry:1bashfulness:1bashful:1below:2deserves:32barrenly:1 ...

Summarize

Apache Beam is mainly for ideal parallel data processing tasks, and by splitting the dataset into multiple sub-datasets, each sub-dataset can be processed separately, thus enabling the parallelization of the overall data set. Of course, you can also use Beam to handle extraction, transformation, and loading tasks and data integration tasks (an ETL process). The data is further read from different storage media or data sources, converted into data formats, and finally loaded into the new system.

Beam Programming Series Java SDK Quickstart (recommended steps for official website)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.