Separate application (translated from learning.spark.lightning-fast.big.data.analysis)

Source: Internet
Author: User

During this rough walkthrough of spark, we haven't talked about how to use spark in a separate application. Aside from the interactive run, we can connect spark in Java,scala or in this Python program. The only difference from connecting spark in the shell is that you need to initialize the Sparkcontext yourself in the program.


The process of connecting to spark varies by language. In Java and Scala, you can add a dependency on Spark-core to your application's maven dependency. By the time the book was written, the latest version of Spark was 1.2.0, and its corresponding MAVEN coordinates were:

Groupid=org.apache.spark

artifactid=spark-core_2.10

version=1.2.0


MAVEN is a popular package management tool for the Java (original: java-based) language, which allows you to use packages from public repositories. You can build your project through MAVEN, or you can build projects with other work that can be done with the Maven Repository dialog, including Scala's SBT tools and Gradle. A popular integrated development environment, such as Eclipse, allows you to add Maven dependencies directly to your project.


In Python, you just have to write the application as a Python script, but you have to run them with the Bin/spark-submit script that comes with spark. The Bin/spark-submit script contains the spark dependencies that are required in Python. The script sets environment variables into functions for the Spark's Python API. Run your script as in example 2-6.

Example 2-6. Run the Python script
Bin/spark-submit my_script.py

(Note that in Windows you have to use a backslash instead of a forward slash)


Initialize Sparkcontext

Once your application is connected to spark, you need to import the spark package into your application and create a Sparkcontext object. You need to create a Sparkconf object to configure your application and then create the sparkcontext object with this Sparkconf object. Example 2-7 to example 2-9 shows the process of creating a sparkcontext object for each spark-supported language.

Example 2-7. Initializing spark in Python
From Pyspark import sparkconf, Sparkcontext
conf = sparkconf (). Setmaster ("local"). Setappname ("My App")
sc = sparkcontext (conf = conf)

Example 2-8. initializing spark in Scala
Import org.apache.spark.SparkConf
Import Org.apache.spark.SparkContext
Import Org.apache.spark.sparkcontext._
Val conf = new sparkconf (). Setmaster ("local"). Setappname ("My App")
Val sc = new Sparkcontext (conf)

Example 2-9. initializing spark in Java
Import org.apache.spark.SparkConf;
Import Org.apache.spark.api.java.JavaSparkContext;
sparkconf conf = new sparkconf (). Setmaster ("local").


These examples show a simple way to initialize a Sparkcontext object, and we only set two parameters:

1. The cluster URL ("Local" in these examples) tells Spark how to connect to the cluster, and local is a special value that enables spark to run on one thread of the local machine without connecting to the cluster.

2. The application name (in these examples is "My App"), when you connect to the cluster, the name will identify your application on the UI of the cluster manager.

In addition to these two parameters, there are other parameters that you can use to configure how your application executes, or to add code that is passed to the cluster to execute, but these are not covered in the chapters that follow this book.


After you initialize your Sparkcontext object, you can use all the methods we've shown before to create an RDD (for example, from a file) or manipulate these rdd.


Finally, to turn off spark, you can call the Stop () method on your Sparkcontext object, or simply eject the program (with System.exit (0) or sys.exit ()).


This brief explanation of spark should be enough to allow you to run a separate spark application on your computer. For more advanced configurations, the seventh chapter will tell you how to connect your program to the cluster, including packaging your program so that its code is automatically routed to the work node. For now, please refer to the official Spark documentation quick Start Guide.


Building a separate application

If we do not have a word count example, the introductory chapters of this big Data book will be incomplete. In a single machine, the implementation of statistical words is simple, but in a distributed framework, it is a common example, because it involves reading data from a large number of working nodes and merging data. We'll look at if you build and package a word count program through SBT and maven. All of our examples can be built together, but in order to illustrate a simple build with the smallest amount, we have a smaller project in the Learning-spark-examples/mini-complete-example directory, As you can see in examples 2-10 (Java) and 2-11 (Scala).

//  create a sparkconf
sparkconf conf = new sparkconf (). Setappname ("WordCount");
Javasparkcontext sc = new Javasparkcontext (conf);
//  Load Our data
javardd<string> input = Sc.textfile (inputfile);
  segmentation Behavior Word Set
javardd<string> words = Input.flatmap (
new flatmapfunction<string, string> () {
Public iterable<string> Call (String x) {
Return arrays.aslist (X.split (""));
}});
//   converted to words: number pairs, statistical words
javapairrdd<string, integer> counts = Words.maptopair (
New Pairfunction<string, String, integer> () {
Public tuple2<string, integer> call (String x) {
return new Tuple2 (x, 1);
}}). Reducebykey (new Function2<integer, Integer, integer> () {
Public Integer call (integer x, integer y) { return x + y;}});
//  the word count to save text
Counts.saveastextfile (outputFile);


Example 2-11. Word Count Scala Program- Don't worry about details now

Create a sparkconf
Val conf = new sparkconf (). Setappname ("WordCount")
Val sc = new Sparkcontext (conf)
load our data .
Val input = Sc.textfile (inputfile)
segmentation Behavior Word Set
Val words = input.flatmap (line = Line.split (""))
convert to words: number of times pairs, statistical words
Val counts = Words.map (Word = + (Word, 1)). Reducebykey{case (x, y) = + x + y}
Save the word count text

Counts.saveastextfile (OutputFile)


We can use SBT (example 2-12) or Maven(example 2-13) to build these applications with very simple build files. We have already tagged spark core dependencies already provided, so, after we use a assembly jar, we will not include the Spark-core jar because the Spark-core jar is already under the class path of the work node.

Example 2-12. SBT Build File

Name: = "Learning-spark-mini-example"
Version: = "0.0.1"
Scalaversion: = "2.10.4"
Additional libraries
Librarydependencies ++= Seq (
"Org.apache.spark" percent "Spark-core"% "1.2.0"% "provided"
)


Example 2-13. Maven Build File

Little Tips

The Spark-core package is marked as provided in case it is packaged into the assembly jar with our application. The seventh chapter will tell you more about the details.


Once our build files are defined, we can easily package our applications and then run them with the Bin/spark-submit script. Spark-submit needs to set some of the environment variables that spark needs to use. In the Mini-complete-example directory, we can build Java (example 2-14) and Scala(example 2-15) programs.

Example 2-14. Scala Build and run
SBT Clean Package
$SPARK _home/bin/spark-submit \
-class com.oreilly.learningsparkexamples.mini.scala.WordCount \
./target/. (as above) \
./readme.md./wordcounts


Example 2-15. Maven Build and run
MVN clean && mvn compile && mvn
$SPARK _home/bin/spark-submit \
-class com.oreilly.learningsparkexamples.mini.java.WordCount \
./target/learning-spark-mini-example-0.0.1.jar \
./readme.md./wordcounts


For more detailed examples of connecting applications to clusters, please refer to the Spark official documentation quick Start
Guide. The seventh chapter will package the application in more detail.

Separate application (translated from learning.spark.lightning-fast.big.data.analysis)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.