Update apt
After logging in with a Hadoop user, we'll update apt, and we'll use apt to install the software, and there may be some software that can't be installed if it's not updated. Press Ctrl+alt+t to open the terminal window and execute the following command:
sudo apt-get update
If the following "hash check and inconsistent" prompt, you can change the software source to resolve. If you do not have the problem, you do not need to change it. In the process of downloading some software from a software source, it is recommended that you change the source of the software because of the inability to download it for network reasons. During the course of learning Hadoop, the installation of Hadoop is not affected even if the "hash check and mismatch" prompt appears.
Install Vim
Subsequent needs to change some of the configuration files, it is recommended to install Vim (you can also use gedit, this time to use Vim to gedit, so you can use a text editor to modify, and each time the file changes completed, please close the entire gedit program, otherwise it will occupy the terminal):
sudo apt-get install vim
If you need confirmation when installing the software, enter Y at the prompt.
Vim Simple Operation Guide
Vim's common mode is divided into command mode, insert mode, visual mode, Normal mode. In this tutorial, you only need to use Normal mode and insert mode. Switching between the two can help you to complete this guide's learning. Normal mode normal mode is used primarily for browsing text content. Opening vim at first is normal mode. Press the ESC key in any mode to return to normal mode insert edit mode Insert edit mode to add content to the text. In normal mode, enter the I key to enter the edit mode to exit If you have the use of Vim to modify any text, be sure to remember to save. ESC returns to normal mode, then enter: Wq to save the text and exit vim
Install SSH, configure SSH login without password
The cluster, single-node mode requires SSH login (similar to remote login, you can log on to a Linux host and run commands on it), Ubuntu has the SSH client installed by default, and also needs to install SSH server:
sudo apt-get Install Openssh-server
After installation, you can use the following command to log on to the machine:
SSH localhost
At this point, you will be prompted with the following (SSH first login hint), enter Yes. Then follow the prompts to enter the password Hadoop, so it landed on the machine.
But this login is required to enter the password every time, we need to configure SSH without password login more convenient.
First exit the SSH just now, go back to our original terminal window, then use Ssh-keygen to generate the key and add the key to the authorization:
Exit ~/.ssh/ # If you do not have this directory, please first execute SSH localhostssh-keygen-t RSA # will be prompted, press ENTER to be able to cat. /id_rsa.pub >>./authorized_keys # Join license
~ The meaning of the Linux system, ~ Represents the user's home folder, the "/home/User name" directory, such as your user name is Hadoop, then ~ represents "/home/hadoop/". In addition, the text after # in the command is a comment, just enter the preceding command.
At this point ssh localhost
, the command, no need to enter the password can be directly logged in.
Installing the Java Environment
The Java environment can choose either Oracle's JDK or OpenJDK, and the new version will be fine under OpenJDK 1.7. Install OpenJDK 8 directly from the command below:
sudo apt-get install openjdk-8-jre openjdk-8-jdk
After installing the OpenJDK, you need to locate the appropriate installation path, which is used to configure the JAVA_HOME environment variable. Execute the following command:
Dpkg-l openjdk-7'/bin/javac'
The command outputs a path, removing the "/bin/javac" at the end of the path, and the rest is the correct path. If the output path is/usr/lib/jvm/java-8-openjdk-amd64/bin/javac, the path we need is/usr/lib/jvm/java-8-openjdk-amd64.
Then we need to configure the JAVA_HOME environment variable, for convenience, we set in the ~/.BASHRC:
Vim ~/.BASHRC
Add a separate line at the front of the file (note that there can be no spaces before and after the = number), change the "JDK installation path" to the path obtained by the above command, and save the following:
Export JAVA_HOME=/USR/LIB/JVM/JAVA-8-OPENJDK-AMD64
You will then need to make the environment variable effective, and execute the following code:
SOURCE ~/.BASHRC # makes variable settings effective
When set up, let's check to see if it's set correctly:
echo $JAVA _home -version$java_home/bin/java-version # As with direct execution java-version
If set correctly, the $JAVA_HOME/bin/java -version
Java version information is output, and java -version
the output is the same as shown in the following example:
In this way, the Java runtime environment required for Hadoop is installed.
Installing Hadoop 2
Hadoop 2 can be downloaded via http://mirror.bit.edu.cn/apache/hadoop/common/or http://mirrors.cnnic.cn/apache/hadoop/common/, Generally choose to download the latest stable version, that is, download "stable" under the hadoop-2.x.y.tar.gz format of the file, which is compiled, and another containing SRC is the Hadoop source code, need to compile to use.
as of December 9, 2015, the Hadoop official website has been updated to version 2.7.1. For 2.6 . 0 or later of Hadoop, You can still follow this tutorial to learn, you can rest assured download the latest version of the official website of Hadoop. If the reader is a user who installs the Ubuntu system using a virtual machine, access this guide using Ubuntu from the virtual machine, and then click on the address below To download the Hadoop file to the virtual machine in Ubuntu. Please do not use the Windows system under the browser download, the files will be downloaded to the Windows system, the virtual machine in Ubuntu can not access the external Windows system files, causing unnecessary trouble. If the reader is a user who installs the Ubuntu system using a dual system approach, go to the Ubuntu system, open the Firefox browser on the Ubuntu system to access this guide, and click on the address below to download: hadoop-2.7.1
after downloading a Hadoop file, you can generally use it directly. However, if the network is not good, may cause the download file is missing, you can use MD5 and other detection tools can verify the file is complete. Download the official web site for Hadoop -2. X.y.tar.gz.mds this file, which contains the test value can be used to check the integrity of the hadoop-2.x.y.tar.gz, or if the file is damaged or downloaded incomplete, Hadoop will not function properly. The files involved in this article are downloaded through a browser and are saved by default in the downloads directory (unless you change the appropriate directory for the tar command yourself). In addition, this tutorial chooses version 2.6.0, if you are not using the 2.6.0 version, change the 2.6.0 that appears in all commands to the version you are using.
cat ~/Download/hadoop-2.6 . 0 . Tar.gz.mds | grep " md5 -n 6 ~/Download/hadoop-< Span style= "COLOR: #800080" >2.7 . 1 . Tar.gz.mds # 2.7 .1 version format has changed, you can output md5sum ~/download/hadoop-2.6 . 0 . tar.gz | TR " a-z " " a-z " # Calculates the MD5 value and converts it to uppercase for easy comparison
If the file is not complete, the two values are generally very different, you can simply compare the first few characters followed by the next few characters are equal, as shown, if the two values are not the same, be sure to re-download.
We chose to install Hadoop into the/usr/local/:
sudo tar-zxf ~/download/hadoop-2.6. 0. tar.gz-c/usr/local # Extract to/usr//usr/local/sudo mv. /hadoop-2.6. 0/./Hadoop -R hadoop./hadoop # Modify file permissions
Hadoop can be used after decompression. Enter the following command to check if Hadoop is available, and success will display the Hadoop version information:
cd/usr/local/Hadoop. /bin/hadoop version
relative paths and absolute paths Be sure to note the relative path and absolute path in the command, as it will appear later in this article. /bin/...,./etc/... The paths that contain./are relative paths, with/usr/local/hadoop as the current directory. For example, executing a./bin/hadoop version in the/usr/local/hadoop directory is equivalent to executing/usr/local/hadoop/bin/hadoop version. You can change the relative path to an absolute path to execute, but if you are executing in the home folder ~./bin/hadoop version, the execution will be/home/hadoop/bin/hadoop version, it is not what we want.
Hadoop stand-alone configuration (non-distributed)
Hadoop default mode is non-distributed mode (local mode) and runs without additional configuration. Non-distributed, single-Java process for easy debugging.
Now we can perform an example to feel the operation of Hadoop. Hadoop comes with a rich example (running ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar
can see all the examples), including WordCount, Terasort, join, grep, and so on.
Here we choose to run the grep example, we use all the files in the input folder as inputs and filter them to match the regular expression dfs[a-z. + The number of occurrences of the word, and the final output to the Outputs folder.
cd/usr/local/Hadoopmkdir. /inputcp. /etc/hadoop/*. Xml./input # Use the configuration file as the input file ./bin/hadoop jar./share/hadoop/mapreduce/ Hadoop-mapreduce-examples-*.jar grep/input./output ' dfs[a-z. + 'cat./output/* # view Run results
After successful execution, the relevant information of the job is output, and the result of the output is that the word dfsadmin with regular words appears 1 times.
Note that Hadoop does not overwrite the result file by default, so running the above instance again prompts an error and needs to be ./output
removed first.
Rm-r./output
Install Spark
Visit spark official, download and unzip as follows.
sudo tar-zxf ~/download/spark-1.6. 2-bin-without-hadoop.tgz-c/usr/local//usr/localsudo mv. /spark-1.6. 2-bin-without-hadoop/./-R hadoop:hadoop./spark # Here's Hadoop for your username
After installation, you will also need to modify the spark's configuration file spark-env.sh:
cd/usr/local/SPARKCP. /conf/spark-env.sh.template./conf/spark-env.sh
Edit spark-env.sh File:
Vim./conf/spark-env.sh
In its first line, add the following configuration information:
Export spark_dist_classpath=$ (/usr/local/hadoop/bin/hadoop CLASSPATH)
Once configured, you can use it directly and do not need to run startup commands like Hadoop.
Verify that Spark is installed successfully by running an example from spark.
cd/usr/local/sparkbin/run-example SPARKPI
Execution will output a lot of running information, the output is not easy to find, can be filtered through the grep command (2>&1 in the command can output all the information to the STDOUT, otherwise, due to the nature of the output log, or output to the screen):
2>&1"Piis"
Writing code with the Spark Shell
To learn Spark program development, it is recommended to deepen the understanding of Spark program development by Spark-shell Interactive learning.
This section describes the basic use of the Spark Shell. The Spark shell provides an easy way to learn the API and also provides an interactive way to analyze the data.
The Spark Shell supports Scala and Python, and this section of the tutorial chooses to use Scala for introductions.
start Spark Shell
Bin/spark-shell
When Spark-shell is started, the spark context object named SC and the SQL context object named SqlContext are automatically created.
Load Text file
Spark creates the SC and can load local files and HDFs files to create an RDD. This is tested with a local file Readme.md file from Spark.
Val textfile = Sc.textfile ("file:///usr/local/spark/README.md")
Both the HDFs file and the local file are loaded using Textfile, with the distinction of adding prefixes (hdfs://and file://) for identification.
Simple RDD Operation
// get the first line of the Rdd file Textfile Textfile.first () // gets the count of all items textfile the Rdd file Textfile.count () // extract rows containing "spark" and return a new Rddval linewithspark = textfile.filter (line = Line.contains ("Spark )// Statistics The number of rows of the new Rdd Linewithspark.count ()
A combination of RDD operations can be combined to enable simple mapreduce operations:
// find the maximum number of words per line in the text textfile.map (lines = Line.split (""ifelse b)
exit Spark Shell
Exit Spark shell by entering exit
Exit
Standalone application Programming
We then use a simple application Simpleapp to demonstrate how to write a standalone application through the Spark API. Programs written in Scala need to be compiled and packaged using SBT, and Java programs are packaged using Maven, and Python programs are submitted directly via Spark-submit.
Installing SBT
SBT is a tool that spark uses to package the Scala program, a brief introduction to the SBT installation process, and interested readers can refer to the official website for more information about SBT.
Spark does not have its own SBT, here directly to the Sbt-launch.jar, directly click to download.
We choose to install in/USR/LOCAL/SBT:
sudo mkdir/usr/local/-Rhadoop/usr/local/ SBT/USR/LOCAL/SBT
After downloading, execute the following command to copy to/USR/LOCAL/SBT:
CP ~/Download/sbt-launch.jar.
Next, create the SBT script () in/USR/LOCAL/SBT vim ./sbt
and add the following:
#!/bin/bashsbt_opts="-xms512m-xmx1536m-xss1m-xx:+cmsclassunloadingenabled-xx:maxpermsize =256m"-jar ' dirname $0"[email protected]"
After saving, add executable permissions for the./SBT script:
chmod u+x./SBT
Finally, run the following command to verify that SBT is available (make sure the computer is networked and the first run will be "Getting ORG.SCALA-SBT SBT 0.13.11 ..." Download status, please wait patiently. I waited 7 minutes to appear the first download hint):
./SBT Sbt-version
As long as you can get the version information, no problem:
Scala application code
In the terminal, execute the following command to create a folder Sparkapp as the application root directory:
CD ~ # enters the user's home folder mkdir. /Sparkapp -P./sparkapp/src/main/scala # Create the folder structure you want
Under./sparkapp/src/main/scala, create a file named Simpleapp.scala ( vim ./sparkapp/src/main/scala/SimpleApp.scala
) and add the following code:
/*Simpleapp.scala*/Import Org.apache.spark.SparkContextimport Org.apache.spark.sparkcontext._import org.apache.spark.SparkConf ObjectSimpleapp {def main (args:array[string]) {val LogFile="file:///usr/local/spark/README.md" //should is some file on your systemVal conf =NewSparkconf (). Setappname ("Simple Application") Val SC=Newsparkcontext (conf) Val Logdata= Sc.textfile (LogFile,2). Cache () Val Numas= Logdata.filter (line = Line.contains ("a") . Count () Val numbs= Logdata.filter (line = Line.contains ("b") . Count () println ("Lines with a:%s, Lines with B:%s". Format (Numas, Numbs)}}
The program calculates the number of rows in the/usr/local/spark/readme file that contain "a" and the number of rows that contain "B". The/usr/local/spark of line 8th of the code is the installation directory of Spark, and if not the directory, modify it yourself. Unlike the Spark shell, a standalone application needs to val sc = new SparkContext(conf)
contain information about the application by initializing the parameters of the Sparkcontext,sparkcontext sparkconf.
The program relies on the Spark API, so we need to compile the package through SBT:/sparkapp new file SIMPLE.SBT ( vim ./sparkapp/simple.sbt
), add the following, declare the standalone application information and the dependency relationship with Spark.
Name: = " simple Project " version: = " 1.0 " scalaversion : = 2.10.5 " librarydependencies + = " org.apache.spark " percent " spark-core "% " 1.6.2 "
File SIMPLE.SBT need to indicate the version of Spark and Scala. In the configuration information above, scalaversion is used to specify the version of Scala, Sparkcore is used to specify the version of Spark, and these two version information can be found in the previous process of launching the spark shell from the screen display information. The following is the author during the startup process, see the relevant version information (note: The screen display information will be very long, need to scroll back to the screen carefully looking for information).
Use SBT to package Scala programs
To ensure that SBT runs correctly, first check the file structure of the entire application by executing the following command:
CD ~/sparkappfind.
Next, we can package the entire application into jars with the following code (the first run also needs to download the dependency package):
/USR/LOCAL/SBT/SBT Package
If packaged successfully, the location of the generated jar package is output.
Running the program through Spark-submit
Finally, we will be able to submit the generated jar package to spark through Spark-submit and run it with the following command:
/usr/local/spark/bin/spark-submit--class "Simpleapp"~/sparkapp/target/scala-2.10/simple-project_2.Ten-1.0. jar# the above command will output too much information, you can not use the above command, and use the following command to see the desired results/usr/local/spark/bin/spark-submit--class "Simpleapp"~/sparkapp/target/scala-2.10/simple-project_2.Ten-1.0. jar2>&1| Grep"Lines with a:"
The final results are as follows:
- -
Since then, your first Spark application has been completed.
Ubuntu installs Hadoop and spark