Apache Spark Learning: Building spark integrated development environment with Eclipse

Apache Spark Learning: Building spark integrated development environment with Eclipse _apache

Last Update:2018-08-22 Source: Internet

Author: User

Tags scala ide

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous article "Apache Spark Learning: Deploying Spark to Hadoop 2.2.0" describes how to use MAVEN compilation to build spark jar packages that run directly on the Hadoop 2.2.0, and on this basis, Describes how to build an spark integrated development environment with eclipse. It is not recommended that you use Eclipse to develop spark programs and read source code, recommend the use of IntelliJ idea, specifically: Apache Spark Quest: Build the development environment with IntelliJ idea.

(1) Preparatory work

Before the formal introduction, the following hardware and software preparation:

Software Preparation:

Eclipse Juno Version (version 4.2), click here to download: Eclipse 4.2

Scala 2.9.3 version, window Installer can be directly click here to download: Scala 2.9.3

Eclipse Scala IDE plugin, click here to download: Scala IDE (for Scala 2.9.x and Eclipse Juno)

Hardware Preparation

A machine equipped with Linux or Windows operating systems

(2) Building Spark integrated development environment

I operate under the Windows operating system, and the process is as follows:

Step 1: Install Scala 2.9.3: Click Install directly.

Step 2: Copy all files from the features and plugins two directories in the Eclipse Scala IDE plug-in to the corresponding directory after Eclipse decompression

Step 3: Restart Eclipse, click on the Eclipse Right corner box button, as shown in the following figure, click "Other ..." to see if there is a "Scala" item, if any, click Open, or proceed to step 4.

Step 4: In Eclipse, select "Help" –> "Install New Software ..." and fill in the Open card http://download.scala-ide.org/sdk/e38/scala29/ Stable/site, and press ENTER, you can see the following, select the first two to install. (Since step 3 has already copied the jar package to eclipse, install it quickly, just dredge it) after installation, repeat step 3.

(3) Developing spark programs using Scala language

In Eclipse, select "File" –> "New" –> "Other ..." –> "Scala Wizard" –> "Scala project" to create a Scala project, named "Sparkscala".

Right-click the "Saprkscala" project, select "Properties", in the pop-up box, as shown in the following figure, select "Java build Path" –> "libraties" –> "Add External JARs ...", import article " Apache Spark Learning: Deploying Spark to Hadoop 2.2.0

assembly/target/scala-2.9.3/ The Spark-assembly-0.8.1-incubating-hadoop2.2.0.jar in the directory, this jar package can also compile spark generation, placed in the Spark directory assembly/target/ The SCALA-2.9.3/directory.

Like creating a Scala project, add a Scala Class to the project named: WordCount, the entire engineering structure is as follows:

The

WordCount is the most classical word frequency statistic program that will count the total number of occurrences of all words in the directory, Scala code as follows: 1 2 3 4 5 6 7 8 9 Org.apache.spark. _ Import Sparkcontext. _ Object WordCount { def main (args:array[string]) { if args.length! = 3) { ; println ("Usage is org.test.WordCount <master> <input> <output>") & nbsp; return } val sc = new Sparkcontext (args (0), "WordCount", system.getenv ("Spark_home"), Seq (System.getenv ("Spark_ Test_jar ")) val textfile = Sc.textfile (args (1)) val result = Textfil E.flatmap (line = > line.split ("\\s+")) . Map (Word = > (word, 1)). Reducebykey (_ + _) Result.saveastextfile (args (2)) }}

In Scala project, right click "Wordcount.scala", select "Export", and select "Java" –> "jar File" in the pop-up box, and then compile the program into a jar package, which can be named " Spark-wordcount-in-scala.jar ", I exported the jar package download address is Spark-wordcount-in-scala.jar.

The WordCount program receives three parameters, the master location, the HDFs input directory and the HDFs output directory, for which you can write run_spark_wordcount.sh scripts:

# configured as yarn configuration file storage Directory

Export yarn_conf_dir=/opt/hadoop/yarn-client/etc/hadoop/

Spark_jar=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.2.0.jar \

./spark-class org.apache.spark.deploy.yarn.Client \

–jar spark-wordcount-in-scala.jar \

–class WordCount \

–args yarn-standalone \

–args hdfs://hadoop-test/tmp/input \

–args hdfs:/hadoop-test/tmp/output \

–num-workers 1 \

–master-memory 2g \

–worker-memory 2g \

–worker-cores 2

Need to note the following: WordCount program input parameters by "-args" specified, each parameter is individually specified, the second parameter is the HDFS on the input directory, you need to create a good, and upload a few text files, so that statistical frequency, the third parameter is HDFS on the output directory, Dynamic creation, cannot exist before running.

Run the run_spark_wordcount.sh script directly to get the result of the operation.

During the run, a bug,org.apache.spark.deploy.yarn.client is found with a parameter "–name" to specify the application name:

However, in the course of use, this parameter blocks the application, viewing the source code found to be a bug, which has been submitted to spark Jira: 1 2 3 4 5 6 7 8 9 10 11 12//Location: New-yarn/src/main/scala/org/apache /spark/deploy/yarn/clientarguments.scala case ("--queue"):: Value:: tail = > Amqueue = value args = Tail case ("--name"):: Value:: tail = > appName = value args = Tai L//missing This line of code, causing the program to block case ("--addjars"):: Value:: tail = > addjars = value args = Tai L

Therefore, you should not use the "–name" this parameter, or fix the bug, recompile spark.

(4) Developing spark programs using the Java language

As with normal Java program development, the spark development package Spark-assembly-0.8.1-incubating-hadoop2.2.0.jar as a three-party dependency library.

(5) Summary

Preliminary trial spark on yarn process, found that the problem is still very much, very inconvenient to use, the threshold is still very high, far less than the spark on Mesos mature.

Original articles, reproduced please specify: Reprinted from Dong's Blog

This article link address: http://dongxicheng.org/framework-on-yarn/spark-eclipse-ide/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More