How to build a spark development environment using MAVEN step-by-step in IntelliJ idea, and write a simple wordcount instance of Spark based on Scala.
1. Preparatory work
First you need to install JDK and Scala and development tools on your computer IntelliJ idea, this article uses the Win7 system, the environment is configured as follows:
Jdk1.7.0_15
scala2.10.4
Scala official website Download address: http://www.scala-lang.org/download/
If it is windows, download the MSI installation package.
These two can be downloaded from the official web JDK and Scala installation package can be directly double-click the installation package to run the installation. Note: If you write your spark code locally and then upload it to the spark cluster, be sure to keep the development environment consistent, or there will be a lot of errors.
Intellij idea
Download the general community version in the lower right corner of the official website, download the address https://www.jetbrains.com/idea/download/#section =windows
2. Install the Scala plugin in IntelliJ idea
Install IntelliJ idea and go to the main interface
(1) Locate the Configure option in the lower right corner and open the plugins
(2) Click on the lower left corner of the browse repositories ...
(3) In the search box to find Scala, there is a relative to the Scala plugin, which I have installed completed, not installed will display the installation of the word and relative version, here is not recommended to install the plug-in online, recommended according to updated 2014/12/ 18 to download the offline Scala plugin, such as the idea updated date in this article is 2014/12/18 and then find the corresponding plugin version is 1.2.1, download can. Below is the offline download address for the Scala plugin.
Scala plugin offline download address: Https://plugins.jetbrains.com/plugin/1347-scala
Then according to the update date to find IntelliJ idea corresponding to the Scala plugin, different versions of idea corresponding to the Scala plugin is not the same, please be sure to download the corresponding Scala plugin otherwise unrecognized.
(4) After the offline plugin download is complete, the offline Scala plugin is added to idea in the following ways: Click Install plugin from disk ..., and then find the native disk location of your Scala plugin's zip file, click OK
Here, the steps to install the Scala plugin in IntelliJ idea are all done. Next, use idea to build a MAVEN project to build the spark development environment.
3.Intellij Idea builds a spark environment with Maven
(1) Open idea to create a new MAVEN project, as shown below:
Note: Follow the order of my steps.
Note: If you are developing a spark environment for the first time using MAVEN to build Scala, there will be a selection of the Scala SDK and Module SDK, where you can choose the path and path of the JDK when you install Scala.
(2) Fill in GroupID and artifactid here I wrote a name, as shown below, click Next.
(3) The third step is very important, first of all, your IntelliJ idea has maven, the general new version will bring Maven, and Maven directory in the idea installation path under plugins can be found, and then maven home Directory address to fill in the path of MAVEN, the idea version of this article is older, is their own Maven installed on (will not be able to Baidu, very simple, it is recommended to use the new idea, do not need to download maven yourself). Then the user settings file is your Maven path conf inside the settings.xml files, check the override can be, the local repository path can not be modified, the default is good, You can also create a new directory. Click Next.
Note: The screenshot of the time to forget, the local repository in front of the override also tick, or build up will be error, at least my is so.
(4) Fill in the name of your project, and feel free to do so. Click Finish.
(5) The entire process has been completed and the following interface will be displayed when finished:
The import in the upper right corner requires a click.
(6) The next step is to include some of the dependency packages required by the spark environment in the Pom.xml file. In the form of code, it is easy to copy.
Here is my pom file code, please cut or add a dependent package to your own needs.
//Note that the version of the face must correspond well, I here the spark version is 1.6.0 corresponding to the Scala is 2.10, because I am through spark-core_${scala.version} is looking for spark dependency package, Some days ago a colleague followed this to build, because the version of the last spark dependent package always fail. Please check your version yourself.
<?xml version= "1.0" encoding= "UTF-8"?> <project xmlns= "http://maven.apache.org/POM/4.0.0" xmlns:xsi= "htt P://www.w3.org/2001/xmlschema-instance "xsi:schemalocation=" http://maven.apache.org/POM/4.0.0 Http://maven.apach E.org/xsd/maven-4.0.0.xsd "> <modelVersion>4.0.0</modelVersion> <groupid>com.xudong</grou pid> <artifactId>xudong</artifactId> <version>1.0-SNAPSHOT</version> <propertie S> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <spark.version> 1.6.0</spark.version> <scala.version>2.10</scala.version>
Here are a few small questions to keep in mind:
There's going to be Src/main/scala and Src/test/scala in there. You need to build these two folder paths in the corresponding project directory, and if not, you will get an error.
Here, a Scala-based spark development environment is basically over. Next, use Scala to write a simple example of Spark, wordcount program, if some students have written mapreduce will be very familiar.
4.Spark Simple Example WordCount
Src/main/scala folder, right-click New package, enter the package name, I am here Com.xudong then create a new Scala class, then enter the name to change the type to object, as shown below:
Add:
If the Scala SDK was not added to the project at first, this time, the new Scala class will find that there is no option, this time you create a new file, then the name of a random, the suffix changed to. scala*, When you click OK, the blank area of the file will show the SDK without Scala, this time you can add the local Scala SDK by clicking on the message (the Scala is already installed on your PC, this time it will automatically identify the SDK), and the new Scala class will have this option. You can create a new direct. *
After creating the WordCount code, the code is as follows (and comments the relevant explanations):
Package Com.xudong
Import org.apache.spark.mllib.linalg.{ Matrices, Matrix}
import org.apache.spark.{ Sparkcontext, sparkconf}
/**
* Created by Administrator on 2017/4/20.
* Xudong
*
/Object wordcountlocal {
def main (args:array[string]) {
/**
* sparkcontext The initialization requires a Sparkconf object
* sparkconf contains various parameters for the configuration of the Spark cluster */
val conf=new sparkconf ()
. Setmaster (" Local ")//Start the localization calculation
. Setappname (" Testrdd ")//Set the program name
//spark program is written from the beginning of the Sparkcontext
Val sc=new Sparkcontext (CONF)
//The statement above is equivalent to Val sc=new sparkcontext ("local", "Testrdd")
val data=sc.textfile ("e:// Hello.txt ")//Read local file
data.flatmap (_.split (" "))//underscore is a placeholder, FlatMap is a way to manipulate rows, splitting the data that is being read in
. Map ((_,1))// Convert each item to Key-value, and the data is Key,value is 1
. Reducebykey (_+_)//combine items with the same key into one
. Collect ()// Return the distributed Rdd to a single Scala array, use Scala's function on this array, and return the result to the driver
. foreach (println)//loop print
}
}
Create a DataSet Hello.txt the test document is as follows:
Start the local Spark program, and then output the results, and you can view the results in the console:
If you can print the results correctly, the Spark sample runs successfully.
Here, Intellij idea using Maven to build the spark development environment is completely over, if in doubt or if there is something wrong with this document, please note that it is not appreciated.
For more information on how to package the spark program locally into a spark cluster and then run it inside the spark cluster, there will be another blog introduction later.