Since Spark is written in Scala, Spark is definitely the original support for Scala, so here is a Scala-based introduction to the spark environment, consisting of four steps: JDK installation, Scala installation, spark installation, Download and configuration of Hadoop. In order to highlight the "from Scratch" characteristics (all the title is not selected for the sake of), so the following steps slightly more verbose, the old driver can not have to read, skip directly on the good.
A Installation of JDK and setting of environment variable 1.1 JDK installation
The JDK (full name is JAVATM Platform standard Edition development Kit) installation, is the Java SE Downloads, generally enter the page, will default to display a version of the JDK, as shown, the current version is the JDK 8, the more detailed specific address is the Java SE Development Kit 8 Downloads:
The two places marked in red are clickable, and you can see some more detailed information about the latest version when you click inside, as shown in:
First of all, there are two versions of 8u101 and 8u102, and the official instructions given by Java are:
"Java SE 8u101 includes important security fixes. Oracle strongly recommends that all Java SE 8 users upgrade to this release. Java SE 8u102 is a patch-set update, including all of 8u101 plus additional features (described in the release notes). ”
This means that Java recommends that all developers upgrade from previous versions to JDK 8u101, while JDK 8u102 has a number of features in addition to all of the 101 features. For the version of the choice, the choice is good, in fact, for ordinary developers, there is no big difference, I am using the JDK 8u101 version.
Select the 8u101 version, then choose your corresponding development platform, because my machine is 64 bits, so I choose Windows64 bit version here. Remember, before downloading, you must accept the above license agreement, circled in red.
In addition to downloading the latest version of the JDK, it can also be downloaded to the historical version of the JDK in Oracle Java Archive, but the official recommendation is only for testing purposes.
JDK installation under Windows is very simple, follow the normal software installation ideas to double-click the downloaded EXE file, and then set your own installation directory (installation directory in the setting of environment variables need to use).
1.2 Setting of environment variables
Next, set the appropriate environment variables, set the method: Right-click on the desktop "computer"-"Properties"-"Advanced system Settings", and then select "Advanced" in the System Properties--"Environment variables", and then find the "Path" variable in the system variable, and choose the "Edit" button to come out a dialog box, You can add the Bin folder path name under the JDK directory installed in the previous step, my Bin folder path name is: F:\Program files\java\jdk1.8.0_101\bin, so add this to the path path name, note the English semicolon “;” To split. After this is set up, you can run the CMD command-line window open under any directory
java -version
See if you can output the relevant Java version information, and if you can output it, the JDK installation step is all over.
The entire process is as shown (the system variable settings for subsequent software installations are this process):
1.3 Some digression
Here are two off-topic, you crossing do not care about the words can be skipped here, does not affect the subsequent installation steps.
In the software installation, I believe that you have encountered the environment variables and system variables, so here to grilled a headache path, classpath and java_home parameters such as the specific meaning.
1.3.1 Environment variables, system variables, and user variables
- Environment variables include system variables and user variables
- The settings for the system variables work for all users under the operating system;
- Settings for user variables only work for the current user
If you are not particularly familiar with these concepts, it is advisable to look at the following points before looking back at these three words.
1.3.2 PATH
That is the system variable set in the previous step, tell the operating system where to find the execution path of Java.exe, when you sudden the command Line window to the following command,
java -version
The operating system first startled, what the hell does "Java" mean? But spit groove to spit groove, live or have to dry, so leisurely remember the gates dad said three words:
- When you don't understand a command in the command Line window, you first go to the current directory you are looking for, do you have the. exe program for this command? If so, use it to start the execution;
- If not, do not give up, remember to go to those directories under the PATH system variable to find out, if found, start and execute the command;
- If the above two places are still not found, then you have to sprinkle a sweet, to report a mistake.
So we add the Bin folder under the JDK installation directory to the path system variable, and here we tell the operating system: if you can't find the Java.exe in the current directory, go to the PATH system variable and look for it in one of the paths until you find Java.exe. So why set up the bin folder instead of the root directory of the JDK installation? The reason is that there is no java.exe in the root directory, only the bin folder can be ah hello ...
If you just run the Java command in the command-line window, you can also not set the system variables, but each time you run Java commands in a command-line window, you must take a long string of pathname to directly specify the location of the Java.exe, as shown below.
C:\Users\weizierxu>F:\Program Files\Java\jdk1.8.0_101\bin\java.exe -version‘F:\Program‘ 不是内部或外部命令,也不是可运行的程序或批处理文件。
Note: The reason for the error here is not that there is a problem with specifying the path name of the Java.exe directly, but that the path name with spaces cannot be resolved under the command line, so you need to use double quotes, as follows:
C:\Users\weizierxu>"F:\Program Files"\Java\jdk1.8.0_101\bin\java.exe -versionjava version "1.8.0_101"Java(TM) SE Runtime Environment (build 1.8.0_101-b13)Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
1.3.3 CLASSPATH
Classpath is when Java executes a compiled class file, tell Java which directories to find the class file, such as your program to use a jar package (the jar package is a compiled class file), then in the execution, Java needs to find this jar package, where to find it? From the directory specified by Classpath, search from left to right (the path names separated by semicolons) until you find the class file with the name you specified, and you will get an error if you can't find it. To do an experiment here, you can understand exactly what it means.
First, in the F:\Program Files\java directory, I wrote a program similar to Hello World with a notepad from windows, saved as a Testclasspath.java file (note that the suffix is changed to Java), as follows:
public class testClassPath{ public static void main(String[] args){ System.out.println("Hello, this is a test on CLASSPATH!"); }}
I then switch the current directory of CMD to the (via cd
command) F:\Program Files\java directory and then compile the javac
file with a command .java
, as shown in:
As you can see, the javac
commands can be used normally (without any output to indicate that they are compiled correctly), This is because the javac.exe that executes the command also exists in the bin directory under the JDK installation path, and this directory has been added to the path system variable, so cmd can recognize the command. This time you can see a testclasspath.class file in the F:\Program Files\java directory. But when I run this class file, I get an error. This time, Classpath will come in handy, and the way the path system variable is set in section 1.2, here in Classpath (if the list of system variables is not classpath this option, then click New, then add a path) in the back add ;.
, the semicolon in English is separated from the previous path, and the following dots .
indicate the meaning of the current directory.
This time remember to have a new CMD window, and then use the cd
command to switch to the Testclasspath.class directory, and then to execute, you can successfully get the results.
a test on CLASSPATH!
Therefore, unlike the path variable, Java executes a class file, there is no default to find the file from the current directory, but only to classpath the specified directory to find the class file, If the classpath specified directory has this class file, then start execution, if there is no error (here is to go to the current directory to find the class file, because the current path through .
the way, has been added to the CLASSPATH system variable).
The above-mentioned method of specifying CLASSPATH system variables is written directly to the system variables, in order to avoid interference (for example, multiple class files with the same name exist in multiple paths, these paths are added to the CLASSPATH system variable, because when looking for a class file, is to scan the path in the CLASSPATH system variable from left to right, so java testClassPath
when the execution of the method is performed, the position is in the leftmost path of the CLASSPATH system variable, the corresponding class file, which is obviously not the result we want. So in these Ides such as Eclipse and so on, it is not necessary to manually set the CLASSPATH system variables, but to set only the specific CLASSPATH system variables of the current program, so that the other programs will not be affected to run.
1.3.4 Java_home
Java_home is not the required parameters of Java itself, but other third-party tools need this parameter to configure their own parameters, it is not only the meaning of the software to tell those, my JDK installed in this directory, if you want to use my Java program, Just come to my directory and find out, and Java_home is the name of the JDK's installation path. For example, if my JDK is installed in the F:\Program Files\Java\jdk1.8.0_101
directory (note that the bin directory in this directory is the value to be added in the Path system variable in section 1.3.2), then the value to be added in the Java_home F:\Program Files\Java\jdk1.8.0_101
is that a similar system variable will be encountered later, which HOME
is the installation directory of the software.
Two. Scala's installation
First download from download PREVIOUS versions to the corresponding version, it is important to note that each version of Spark needs to correspond to the corresponding Scala version, such as the spark 1.6.2 I use here can only use the versions of Scala 2.10, The latest Spark 2.0 can only be used for each version of Scala 2.11, so when downloading, you need to be aware of the relationship between this Scala version and the spark version. I'm now using the Scala 2.10.6, which fits the various versions of Spark from 1.3.0 to spark 1.6.2. On the version page download PREVIOUS versions Select a version that suits your needs, it will go to the specific download page of that version, as shown in, remember to download the binary version of Scala, click the arrow in the figure, download it:
Once you have downloaded the Scala MSI file, you can double-click to perform the installation. After the installation is successful, the Scala bin directory is added to the path system variable by default (if not, the bin directory path under the Scala installation directory is added to the system variable path, similar to the JDK installation step), in order to verify that the installation is successful, open a new CMD window, Input scala
and enter, and if you can enter Scala's interactive command environment, the installation is successful. As shown in the following:
If you cannot display version information and you cannot enter Scala's interactive command line, there are usually two possibilities:
-The path name of the Bin folder under the Scala installation directory is not added correctly in the path system variable, and is added as described in the JDK installation.
-Scala does not install correctly, repeat the above steps.
Three. Installation of Spark
The installation of Spark is very simple and goes directly to download Apache spark. There are two steps:
- Select the spark version that corresponds to the Hadoop version, as shown in;
- Then click the middle arrow
spark-1.6.2-bin-hadoop2.6.tgz
to wait for the download to finish.
Here is the version of pre-built, meaning that it has been compiled well, download to use directly, Spark also has source code can be downloaded, but you have to manually compile before you can use. After the download completes the file decompression (may need to extract two times), it is best to extract to a disk root directory, and renamed to Spark
, simple and error-prone. It is also important to note that there are no spaces in the file directory pathname of Spark, and folder names like "program Files" are not allowed.
After the decompression basically almost can be run to the cmd command line. But this time every time you run Spark-shell (Spark's command-Line interactive window), you need cd
to go to the installation directory of Spark, which is cumbersome, so you can add the spark's Bin directory to the system variable path. For example, my spark here in the bin directory path D:\Spark\bin
, then the path name is added to the system variable path, the method and the JDK installation process environment variable settings consistent, after setting the system variables, in any directory of the cmd command line, directly execute the spark-shell
command, To turn on the interactive command-line mode of Spark.
Four Hadoop download
After the system variable is set up, you can run Spark-shell in cmd in any current directory, but this time it is possible to encounter various errors, mainly because Spark is based on Hadoop, so it is also necessary to configure a Hadoop operating environment. The various historical versions of Hadoop can be seen in Hadoop releases, because the downloaded Spark is based on Hadoop 2.6 (the first step in the spark installation, which we chose Pre-built for Hadoop 2.6
), I choose the 2.6.4 version here, select the appropriate version and click, go to the detailed download page, as shown, select the red flag in the image to download, here the SRC version is the source code, need to make changes to Hadoop or want to compile themselves can download the corresponding src file, I download here is the compiled version, that is, figure The ' hadoop-2.6.4.tar.gz ' file in the.
Download and unzip to the specified directory, and then to the Environment variables section set Hadoop_home for the extract directory of Hadoop, I am here F:\Program Files\hadoop
, and then set the directory under the Bin directory to the system variable under path, I here is F:\Program Files\hadoop\bin
, if you have added Hadoop_ Home system variable, which can also be used %HADOOP_HOME%\bin
to specify the Bin folder path name. When these two system variables are set up, open a new cmd and enter the spark-shell
command directly.
Under normal circumstances, you can run successfully and enter the command-line environment of Spark, but for some users you might encounter a null pointer error. This time, the main reason is because there is no Winutils.exe file in the bin directory of Hadoop. The solution here is:
-Go to Https://github.com/steveloughran/winutils Select the Hadoop version number you installed, then go to the Bin directory, find the winutils.exe
file, download the method is click on the winutils.exe
file, After entering, there is a button in the upper right part of the page Download
, click Download.
-After downloading winutils.exe
, put this file into the bin directory of Hadoop, here I am F:\Program Files\hadoop\bin
.
-Enter in an open cmd
F:\Program Files\hadoop\bin\winutils.exe chmod 777 /tmp/hive
This action is used to modify the permissions. Note that the previous F:\Program Files\hadoop\bin
section should be replaced by the actual location of the bin directory where you installed it.
After these steps, then open a new CMD window again, and if normal, you should be able to run spark through direct input spark-shell
.
The normal operating interface should look like the following:
As you can see, when the command is entered directly spark-shell
, Spark starts and outputs some log information, most of which can be ignored, with two sentences to note:
as sc.SQL context available as sqlContext.
Spark context
And the SQL context
difference is what, follow up again, now only need to remember, only see these two statements, only to show that spark really successful launch.
Five. Pyspark under Python
For spark under Python, similar to Scala's Spark-shell, there is also a pyspark, which is also an interactive command-line tool that can perform some simple debugging and testing on spark, similar to Spark-shell. For the need to install Python, it is recommended to use Python (x, y), it is the advantage of the collection of most of the toolkit, do not need to download it alone and can be directly import to use, and also save the cumbersome environment variable configuration, is Python (x, y)- Downloads, when the download is complete, double-click to run the installation. Because this tutorial mainly in Scala, about Python does not do too much to explain.
And, Pyspark execution file and Spark-shell the same path, after extracting the spark in the above way, you can directly in the CMD command line window Execute Pyspark command, start the Python debugging environment.
But if you need to use Pyspark in Python or in Ides like idea IntelliJ or pycharm, you need to create a new system variable in the system variable PYTHONPATH
, and then
PATHONPATH=%SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j-0.10.4-src.zip
When set up, it is recommended to use Pycharm as the IDE (because the idea IntelliJ settings are tedious, impatient to set a bunch of parameters, haha ha)
Six. Summary
At this point, the basic Spark local debugging environment is owned and is sufficient for the initial spark learning. However, this model is still not enough for actual spark development, and it needs to be aided by a more useful IDE to assist the development process. The next lecture will focus on Itellij idea and Maven's configuration process.
Seven. Tips
- The lesson of blood: Never leave any spaces in the software's installation path
- On the internet to find Hadoop 2.7.2 Winutils.exe can not find the time, directly with the 2.7.1 of Winutils.exe, still able to use
(Updated 2017.06.14)
Reference
- Path and CLASSPATH (Oracle's official explanation of path and CLASSPATH, recommended)
- Difference among Java_home,jre_home,classpath and PATH
- The role of classpath, Path, Java_home is set in JAVA
- Why does starting Spark-shell fail with NullPointerException on Windows? (about how to resolve nullpointerexception issues that are encountered when starting Spark-shell)
Spark is built under Windows environment