Beam Programming Series Java SDK Quickstart (recommended steps for official website)

Last Update:2018-06-16 Source: Internet

Author: User

Tags apache beam

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Not much to say, directly on the dry goods!

https://beam.apache.org/get-started/beam-overview/

https://beam.apache.org/get-started/quickstart-java/

Apache Beam Java SDK Quickstart

This Quickstart would walk you through executing your first Beam pipeline to run WordCount, written using Beam ' s Java SDK, On a runner of your choice.

Set up your development environment
Get the WordCount Code
Run WordCount
Inspect the results
Next Steps

I am here in order to facilitate the quick Start, translated and organized into Chinese.

This blog post is done by using the Java SDK, and you can try to run it on a different execution engine.

First step: Setting up the development environment

Download and install the Java development Kit (JDK) version 1.7 or later. Check that the JAVA_HOME environment variable is set and points to your JDK installation directory.
Follow Maven's installation Guide to download and install Apache maven for your operating system.

Step two: Get the WordCount code for the sample

The simplest way to get a copy of the WordCount pipeline code is to use the following directives to generate a simple, Beam-based WordCount sample and build Maven project:

Apache Beam Source code is hosted on GitHub, can be downloaded to github corresponding source code,: Https://github.com/apache/beam

Then, the sample code is packaged, and the command looks like this:(this is the most recent stable version) (so this is generally used)

$ mvn archetype:generate       -darchetyperepository=https://repository.apache.org/content/groups/snapshots       - Darchetypegroupid=org.apache.beam       -darchetypeartifactid=beam-sdks-java-maven-archetypes-examples       - darchetypeversion=      -dgroupid=org.example       -dartifactid=word-count-beam       -dversion= "0.1"       -dpackage=org.apache.beam.examples       -dinteractivemode=false

This is a recommended website.

$ mvn archetype:generate       -darchetypegroupid=org.apache.beam       -darchetypeartifactid= Beam-sdks-java-maven-archetypes-examples       -darchetypeversion=2.1.0       -dgroupid=org.example       - Dartifactid=word-count-beam       -dversion= "0.1"       -dpackage=org.apache.beam.examples       -dinteractivemode= False

That's because the newest bean is 2.1.0.

This creates a named word-count-beam directory that contains a simple pom.xml file and a sample pipeline that calculates the number of individual words in a text file.

$ cd word-count-beam/$ lspom.xml    src$ ls Src/main/java/org/apache/beam/examples/debuggingwordcount.java    Windowedwordcount.java    Commonminimalwordcount.java    Wordcount.java

For a detailed introduction to the concepts of Beam used in these examples, read the WordCount Example Walkthrough article. Here we focus only on how to execute WordCount.java .

Run the WordCount sample code

　a Beam program can run on multiple Beam executable engines, including Apexrunner,flinkrunner,sparkrunner or Dataflowrunner. There are also Directrunner. It can be executed locally without special configuration, so it is easy to test.

Below, you can select the engine that you want to execute the program on demand, that is, which runner after:

Configure the engine to ensure that you have configured the runner correctly.
Use different commands: Specify the engine type by the--runner=<runner> parameter, the default is Directrunner, add engine-related parameters, specify the output file and output directory, of course, you need to ensure that the file directory is the execution engine can access to, For example, a local file directory cannot be accessed by an external cluster.
Run the sample program, your first WordCount pipeline.

Direct

$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount      -dexec.args= "--inputfile=pom.xml-- Output=counts "-pdirect-runner

Apex

$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount      -dexec.args= "--inputfile=pom.xml-- Output=counts--runner=apexrunner "-papex-runner

Flink-local

$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount      -dexec.args= "--runner=flinkrunner- -inputfile=pom.xml--output=counts "-pflink-runner

Flink-cluster

$ MVN Package Exec:java-dexec.mainclass=org.apache.beam.examples.wordcount      -dexec.args= "--runner=flinkrunner- -flinkmaster=<flink master>--filestostage=target/word-count-beam-bundled-0.1.jar                   --inputFile=/path/to  /quickstart/pom.xml--output=/tmp/counts "-pflink-runneryou can monitor the running job by visiting the Flink dashboard at Http://<flink master>:8081

You can then monitor the running application by visiting the Http://<flink master>:8081.

Spark

$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount      -dexec.args= "--runner=sparkrunner- -inputfile=pom.xml--output=counts "-pspark-runner

Dataflow

$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount      -dexec.args= "--runner= Dataflowrunner--project=<your-gcp-project>                   --gcptemplocation=gs://<your-gcs-bucket>/tmp                   -- inputfile=gs://apache-beam-samples/shakespeare/*--output=gs://<your-gcs-bucket>/counts "      - Pdataflow-runner

Run results

When the program finishes running, you can see that there are multiple files starting with count, depending on the type of execution engine. When you look at the contents of a file, the number of occurrences is displayed after each unique word, but the order is not fixed and is a common way for the distributed engine to improve its efficiency.

Once the pipeline finishes running, you can view the results. You will notice that there are multiple count output files that begin with. There are several such documents that are determined by runner. This makes it convenient for runner to perform efficient distributed execution.

When you look at the contents of the file, you will see that it contains the number of occurrences of each word. The order of elements in the file may be different from what you see here. Because the Beam model usually does not guarantee the order, the runner optimizes efficiency.

Direct

$ ls counts*$ more counts*api:9bundled:1old:4apache:2the:1limitations:1foundation:1 ...

Apex

$ cat Counts*beam:1have:1simple:1skip:4passert:1 ...

Flink-local

$ ls counts*$ more counts*the:1api:9old:4apache:2limitations:1bundled:1foundation:1 ...

Flink-cluster

$ ls/tmp/counts*$ More/tmp/counts*the:1api:9old:4apache:2limitations:1bundled:1foundation:1 ...

Spark

$ ls counts*$ more counts*beam:27sf:1fat:1job:1limitations:1require:1of:11profile:10 ...

Dataflow

$ gsutil ls gs://<your-gcs-bucket>/counts*$ gsutil cat gs://<your-gcs-bucket>/counts*feature:15smother ' St:1revelry:1bashfulness:1bashful:1below:2deserves:32barrenly:1 ...

Summarize

Apache Beam is mainly for ideal parallel data processing tasks, and by splitting the dataset into multiple sub-datasets, each sub-dataset can be processed separately, thus enabling the parallelization of the overall data set. Of course, you can also use Beam to handle extraction, transformation, and loading tasks and data integration tasks (an ETL process). The data is further read from different storage media or data sources, converted into data formats, and finally loaded into the new system.

Beam Programming Series Java SDK Quickstart (recommended steps for official website)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More