Not much to say, directly on the dry goods!
https://beam.apache.org/get-started/beam-overview/
https://beam.apache.org/get-started/quickstart-java/
Apache Beam Java SDK Quickstart
This Quickstart would walk you through executing your first Beam pipeline to run WordCount, written using Beam ' s Java SDK, On a runner of your choice.
- Set up your development environment
- Get the WordCount Code
- Run WordCount
- Inspect the results
- Next Steps
I am here in order to facilitate the quick Start, translated and organized into Chinese.
This blog post is done by using the Java SDK, and you can try to run it on a different execution engine.
First step: Setting up the development environment
- Download and install the Java development Kit (JDK) version 1.7 or later. Check that the JAVA_HOME environment variable is set and points to your JDK installation directory.
- Follow Maven's installation Guide to download and install Apache maven for your operating system.
Step two: Get the WordCount code for the sample
The simplest way to get a copy of the WordCount pipeline code is to use the following directives to generate a simple, Beam-based WordCount sample and build Maven project:
Apache Beam Source code is hosted on GitHub, can be downloaded to github corresponding source code,: Https://github.com/apache/beam
Then, the sample code is packaged, and the command looks like this:(this is the most recent stable version) (so this is generally used)
$ mvn archetype:generate -darchetyperepository=https://repository.apache.org/content/groups/snapshots - Darchetypegroupid=org.apache.beam -darchetypeartifactid=beam-sdks-java-maven-archetypes-examples - darchetypeversion= -dgroupid=org.example -dartifactid=word-count-beam -dversion= "0.1" -dpackage=org.apache.beam.examples -dinteractivemode=false
This is a recommended website.
$ mvn archetype:generate -darchetypegroupid=org.apache.beam -darchetypeartifactid= Beam-sdks-java-maven-archetypes-examples -darchetypeversion=2.1.0 -dgroupid=org.example - Dartifactid=word-count-beam -dversion= "0.1" -dpackage=org.apache.beam.examples -dinteractivemode= False
That's because the newest bean is 2.1.0.
This creates a named word-count-beam
directory that contains a simple pom.xml
file and a sample pipeline that calculates the number of individual words in a text file.
$ cd word-count-beam/$ lspom.xml src$ ls Src/main/java/org/apache/beam/examples/debuggingwordcount.java Windowedwordcount.java Commonminimalwordcount.java Wordcount.java
For a detailed introduction to the concepts of Beam used in these examples, read the WordCount Example Walkthrough article. Here we focus only on how to execute WordCount.java
.
Run the WordCount sample code
a Beam program can run on multiple Beam executable engines, including Apexrunner,flinkrunner,sparkrunner or Dataflowrunner. There are also Directrunner. It can be executed locally without special configuration, so it is easy to test.
Below, you can select the engine that you want to execute the program on demand, that is, which runner after:
- Configure the engine to ensure that you have configured the runner correctly.
- Use different commands: Specify the engine type by the--runner=<runner> parameter, the default is Directrunner, add engine-related parameters, specify the output file and output directory, of course, you need to ensure that the file directory is the execution engine can access to, For example, a local file directory cannot be accessed by an external cluster.
- Run the sample program, your first WordCount pipeline.
Direct
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount -dexec.args= "--inputfile=pom.xml-- Output=counts "-pdirect-runner
Apex
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount -dexec.args= "--inputfile=pom.xml-- Output=counts--runner=apexrunner "-papex-runner
Flink-local
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount -dexec.args= "--runner=flinkrunner- -inputfile=pom.xml--output=counts "-pflink-runner
Flink-cluster
$ MVN Package Exec:java-dexec.mainclass=org.apache.beam.examples.wordcount -dexec.args= "--runner=flinkrunner- -flinkmaster=<flink master>--filestostage=target/word-count-beam-bundled-0.1.jar --inputFile=/path/to /quickstart/pom.xml--output=/tmp/counts "-pflink-runneryou can monitor the running job by visiting the Flink dashboard at Http://<flink master>:8081
You can then monitor the running application by visiting the Http://<flink master>:8081.
Spark
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount -dexec.args= "--runner=sparkrunner- -inputfile=pom.xml--output=counts "-pspark-runner
Dataflow
$ mvn Compile exec:java-dexec.mainclass=org.apache.beam.examples.wordcount -dexec.args= "--runner= Dataflowrunner--project=<your-gcp-project> --gcptemplocation=gs://<your-gcs-bucket>/tmp -- inputfile=gs://apache-beam-samples/shakespeare/*--output=gs://<your-gcs-bucket>/counts " - Pdataflow-runner
Run results
When the program finishes running, you can see that there are multiple files starting with count, depending on the type of execution engine. When you look at the contents of a file, the number of occurrences is displayed after each unique word, but the order is not fixed and is a common way for the distributed engine to improve its efficiency.
Once the pipeline finishes running, you can view the results. You will notice that there are multiple count
output files that begin with. There are several such documents that are determined by runner. This makes it convenient for runner to perform efficient distributed execution.
When you look at the contents of the file, you will see that it contains the number of occurrences of each word. The order of elements in the file may be different from what you see here. Because the Beam model usually does not guarantee the order, the runner optimizes efficiency.
Direct
$ ls counts*$ more counts*api:9bundled:1old:4apache:2the:1limitations:1foundation:1 ...
Apex
$ cat Counts*beam:1have:1simple:1skip:4passert:1 ...
Flink-local
$ ls counts*$ more counts*the:1api:9old:4apache:2limitations:1bundled:1foundation:1 ...
Flink-cluster
$ ls/tmp/counts*$ More/tmp/counts*the:1api:9old:4apache:2limitations:1bundled:1foundation:1 ...
Spark
$ ls counts*$ more counts*beam:27sf:1fat:1job:1limitations:1require:1of:11profile:10 ...
Dataflow
$ gsutil ls gs://<your-gcs-bucket>/counts*$ gsutil cat gs://<your-gcs-bucket>/counts*feature:15smother ' St:1revelry:1bashfulness:1bashful:1below:2deserves:32barrenly:1 ...
Summarize
Apache Beam is mainly for ideal parallel data processing tasks, and by splitting the dataset into multiple sub-datasets, each sub-dataset can be processed separately, thus enabling the parallelization of the overall data set. Of course, you can also use Beam to handle extraction, transformation, and loading tasks and data integration tasks (an ETL process). The data is further read from different storage media or data sources, converted into data formats, and finally loaded into the new system.
Beam Programming Series Java SDK Quickstart (recommended steps for official website)