Reprinted from: http://www.cnblogs.com/spark-china/p/3941878.html
- Prepare a second, third machine running Ubuntu system in VMware;
Building the second to third machine running Ubuntu in VMware is exactly the same as building the first machine, again not repeating it.
Different points from installing the first Ubuntu machine are:
1th: We name the second to third Ubuntu machine for Slave1, Slave2, as shown in:
There are three virtual machines in the created VMware:
2nd: To simplify the configuration of Hadoop, maintain a minimized Hadoop cluster and log in to the system using the same root superuser as when building the second to third machine.
2. Configure the newly created machine running Ubuntu system in the way of configuring pseudo-distributed mode;
Configure the newly created machine to run the Ubuntu system and configure the first machine exactly the same as configuring the pseudo-distributed mode
is fully installed after the home forest:
3.Configuring a Hadoop Distributed cluster environment
;
Based on the previous configuration, we now have three machines running in VMware with Ubuntu systems, namely: Master, Slave1, Slave2;
Start configuring the Hadoop distributed cluster environment below:
Step 1: Modify the hostname in/etc/hostname and configure the corresponding relationship between the hostname and IP address in the/etc/hosts:
We take the master machine as the main node of Hadoop and first look at the IP address of the master machine:
You can see that the IP address of the current host is "192.168.184.133".
We modify the hostname in/etc/hostname:
To enter the configuration file:
The name of the machine in the configuration file is "Rocky-virtual-machine" and we Change "rocky-virtual-machine" to "Master" as we can see by default when we install the Ubuntu system. As the primary node for the Hadoop distributed cluster environment:
Save exit. At this point, use the following command to view the host name of the current host:
found that the modified hostname is not in effect, to make the newly modified host name take effect, we reboot the system and look at the hostname again:
found that our hostname became the modified "Master", indicating that the modification was successful.
Open the/etc/hosts file:
At this point we find that only the original IP (127.0.0.1) address and host name (localhost) of the Ubuntu system are in the file:
We configure the corresponding relationship between hostname and IP address in/etc/hosts:
Save exit after modification.
Next we use the "ping" command to see if the hostname and IP address the conversion relationship is correct:
You can see that our host "master" corresponding to the IP address is "192.168.184.133", which indicates that our configuration and operation are correct.
Go to the second machine and look at the IP address of the host computer:
You can see that the IP address of this host is "192.168.184.131".
We modified the hostname to "Slave1" in/etc/hostname:
Save exit.
In order for the changes to take effect, we reboot the machine and look at the host name:
Show that our changes are in effect.
Go to the third machine and look at the IP address of the host computer:
You can see that the IP address of this host is "192.168.184.132".
We modified the hostname to "Slave2" in/etc/hostname:
Save exit.
In order for the changes to take effect, we reboot the machine and look at the host name:
Show that our changes are in effect.
The corresponding relationship between the host name and IP address is now configured in the/etc/hosts on Slave1, after opening:
At this point we modify the configuration file to:
Configure the host name and IP address of "master" and "Slave1" and "Slave2" in the corresponding relationship. Save exit.
We'll ping the master at this point. This node discovers no problems with network access:
Next, configure the host name and IP address in the/etc/hosts on the Slave2, as follows:
Save exit.
At this time we ping the master and Slave1 found can be ping pass;
Finally, the corresponding relationship between the host name and IP address is configured in the/etc/hosts on Master, and the following configuration is completed:
At this point, use the ping command on master to communicate with both machines Slave1 and Slave2:
The machine was found to have ping two slave nodes at this time.
Finally, we're testing the Slave1. This machine communicates with master and Slave2:
So far, Master, Slave1, Slave2 these three machines to communicate with each other!
Step 2:ssh no password verification configuration
First, let's take a look at the case where master accesses Slave1 through the SSH protocol without configuration:
At this point we will find that we need the password. We do not log in and exit directly.
How to enable the cluster to SSH password-free?
According to the previous configuration, we have distributed a private key id_rsa and a public key id_rsa.pub in the/root/.ssh/directory on the master, Slave1, Slave2 three machines.
At this point, the Slave1 id_rsa.pub is passed to master as follows:
Also pass Slave2 's id_rsa.pub to master, as shown below:
Check on master to see if you copied it:
At this point we find that the public key of the Slave1 and SLAVE2 nodes has been transmitted.
All public keys are synthesized on the master node:
Copy the master's public key information Authorized_keys to the. SSH directory of Slave1 and Slave1:
Login Slave1 and Slave2 again via SSH:
At this time master through SSH login Slave1 and Slave2 No need password, the same Slave1 or Slave2 through the SSH protocol login to the other two machines also do not need a password.
Step 3: Modify the configuration files for master, Slave1, Slave2
First modify the master's Core-site.xml file, at this time the file content is:
We modified the "localhost" domain name to "Master":
The same operation opens the Slave1 and SLAVE2 nodes core-site.xml, changing the "localhost" domain name to "Master".
Second, modify the master, Slave1, Slave2 mapred-site.xml files.
Enter the master node of the Mapred-site.xml file to change the "localhost" domain name to "Master", save exit.
Similarly, open the Slave1 and Slave2 node mapred-site.xml, change the "localhost" domain name to "Master" and save the exit.
Finally, modify the Hdfs-site.xml file for master, Slave1, Slave2:
We changed the value of "dfs.replication" on three machines from 1 to 3, so that our data would have 3 copies:
Save exit.
Step 4: Modify the Masters and slaves files of the Hadoop configuration files in both machines
First modify the master's Masters file:
Enter file:
Change "localhost" to "Master":
Save exit.
Modify Master's Slaves file,
Enter the file:
The specific changes are:
Save exit.
From the above configuration we can see that we have Master as the master node, and as a data processing node, which is considered 3 copies of our data and our number of machines is limited.
Copy the master configuration masters and slaves files to the Conf folder under the Hadoop installation directory for Slave1 and Slave2, respectively:
Enter the Slave1 or Slave2 node to check the contents of the Masters and slaves files:
The copy was found to be completely correct.
The Hadoop cluster environment is finally configured to complete!
4.Testing Hadoop distributed cluster environment;
First, the file system of the cluster is formatted with the master node:
Enter "Y" to complete the formatting:
After the format is complete, we start the Hadoop cluster:
We're trying to stop the Hadoop cluster:
There is a "no Datanode to stop" error, and the cause of this error is as follows:
A new Namenodeid is created each time a file system is formatted with the Hadoop namenode-format command, and I'm delegating data to our own TMP directory when we build the Hadoop stand-alone pseudo-distributed version. It is now necessary to empty the contents of TMP and its subdirectories under "/USR/LOCAL/HADOOP/HADOOP-1.2.1/" on each machine, while emptying the contents of Hadoop-related content in the "/tmp" directory. Finally, clear the contents of the data and name folders in our Custom HDFs folder:
Remove the same content from the Slave1 and Slave2.
Reformat and restart the cluster, and enter the master's Web console:
At this point you can see that there are only three live nodes, which is what we expected, because our master, Slave1, Slave2 are set to become datanode, of course, the master itself is also namenode.
At this point we look at the process information in three machines through the JPS command:
All the services of the Hadoop cluster are discovered to start normally.
At this point, the Hadoop cluster is built.
Build a spark cluster on the basis of a zero-start Hadoop cluster in 1 and 2, where we use the Spark 1.0.0 release of May 30, 2014, the latest version of Spark, to build a spark cluster based on spark 1.0.0. The required software is as follows:
1.Spark 1.0.0, the author here is using spark-1.0.0-bin-hadoop1.tgz, specifically http://d3kbcqa49mib13.cloudfront.net/spark-1.0.0-bin-hadoop1.tgz
As shown in the following:
The author is saved in the master node as shown in the location:
2. Download the Scala version corresponding to the Spark 1.0.0, and the official requirement is that Scala must be Scala 2.10.x:
The author download is "Scala 2.10.4", the specific official for http://www.scala-lang.org/download/2.10.4.html after downloading on the master node to save as:
Step two: Install each software
Install Scala
- Open the terminal and create a new directory "/usr/lib/scala" as shown in:
2. Unzip the Scala file as shown in:
Move the unpacked Scala to the "/usr/lib/scala" you just created, as shown in
3. Modify the Environment variables:
Enter the configuration file as shown:
Press "I" into insert mode to add Scala's environment writing information to it, as shown in:
As you can see from the configuration file, we set up "Scala_home" and set the SCALA Bin directory to path.
Press "ESC" key to return to normal mode, save and exit the configuration file:
Execute the following command as the modification of the configuration file takes effect:
4. Display the Scala version that you just installed in the terminal, as shown in
The discovery version is "2.10.4", which is exactly what we expected.
When we enter the "Scala" command, we can go directly to the Scala command-line interface:
At this point we enter the expression "9*9":
At this point, we found that Scala correctly helped us figure out the results.
At this point we completed the installation of Scala on master;
Since our spark is running on the master, Slave1, Slave2 three machines, we need to install the same Scala on Slave1 and Slave2, using the SCP command to install the Scala directory and the "~/.BASHRC" are copied to the same directory as Slave1 and Slave2, and you can, of course, manually install them on Slave1 and Slave2 in the same way as the master node.
After the installation of Scala on Slave1, the test results are as follows:
After the installation of Scala on Slave2, the test results are as follows:
At this point, we successfully deployed Scala on three machines, master, Slave1, Slave2.
Install Spark
You need to install spark on all three machines, Master, Slave1, Slave2.
First, install Spark on Master, in the following steps:
First step: Unzip the spark on master:
We extracted directly into the current directory:
At this point, we create the catalog "/usr/local/spark" for Spark:
Copy the extracted "Spark-1.0.0-bin-hadoop1" to/usr/local/spark "below:
Step Two: Configure environment variables
To enter the configuration file:
Add "Spark_home" to the configuration file and add the SPARK's Bin directory to path:
After configuration, save the exit and then make the configuration effective:
Step Three: Configure Spark
Enter the Conf directory for Spark:
Add "Spark_home" to the configuration file and add the SPARK's Bin directory to path:
Copy the Spark-env.sh.template to the spark-env.sh:
Use Vim to open spark-env.sh:
Add the following configuration information to the configuration file:
which
Java_home: Specifies the Java installation directory;
Scala_home: The installation directory for Scala is specified;
SPARK_MASTER_IP: Specifies the IP address of the MASTER node of the spark cluster;
Spark_worker_memoery: The specified WORKER node is able to allocate the maximum memory size to excutors, because our three machine configurations are 2g, in order to maximize the use of memory, here set to 2g;
Hadoop_conf_dir: Specifies the directory of the configuration files for our original Hadoop cluster;
Save exit.
Next, configure the slaves file under Spark's conf to add the worker nodes:
The contents of the file after opening:
We need to change the content to:
We can see that we have three machines set up for the worker node, that is, our primary node is master and worker node.
Save exit.
This is the installation of Spark on master.
Start and view the status of the cluster
The first step: Start the Hadoop cluster, which is explained in detail in the second lecture, and will not repeat here:
After booting, using the JPS command on the master machine, you can see the following process information:
Using JPS on SLAVE1 and Slave2 will see the following process information:
Step two: Start the spark cluster
on the basis of a successful Hadoop cluster launch, launching the spark cluster requires "start-all.sh" in the Sbin directory of Spark:
Next use "start-all.sh" to start the spark cluster!
Readers must note that the "./start-all.sh" must be written to indicate "start-all.sh" in the current directory, because we also have a "start-all.sh" file in the bin directory where Hadoop is configured!
At this point we use JPS to discover that we have "master" and "Worker" two new processes in the primary node as expected!
The new process "Worker" will appear in Slave1 and Slave2 at this time:
At this point, we can go to the Spark Cluster Web page and Access "http://Master:8080": as shown below:
From the page we can see that we have three worker nodes and the information of these three nodes.
At this point, we enter the bin directory of Spark and use the "Spark-shell" console:
At this point we enter the shell world of spark, according to the output of the message, we Can "http://Master:4040" from the web point of view of the Sparkui situation, as shown in:
Of course, you can also look at some other information, such as environment:
At the same time, we can also look at executors:
As you can see, for our shell, driver is master:50777.
At this point, our spark cluster has been built successfully, congratulations!
Step one: Test Spark's work with the shell of Spark
STEP1: start the Spark cluster, which in the third lecture is very meticulous, after the start of the WebUI as follows:
STEP2: start the spark Shell:
You can now view the shell situation through the following Web console:
STEP3: Copy the Spark installation directory "README.MD" to the HDFS system
Start a new command terminal on the master node and go to the Spark installation directory:
We copy the files to the root folder in HDFs:
At this point, we'll look at the Web console and find that the file has been successfully uploaded to HDFs:
STEP4: work with the spark Shell to write code to manipulate the "readme.md" We Upload:
First, let's look at the "SC" in the shell environment, which automatically helps us to produce the environment variables:
It can be seen that SC is an example of Sparkcontext, which is what the system helps us generate automatically when launching the spark Shell, Sparkcontext is to commit the code to the cluster or local channel, we write the spark code, You must have an instance of Sparkcontext whether you want to run a local or a cluster.
Next, we read the file "readme.md":
We save the read content to the file, in fact, file is a mappedrdd, in the code of Spark, everything is based on the RDD operation;
Next, we filter out all the word "Spark" from the file we read.
A filteredrdd is generated at this time;
Next, let's count how many times "Spark" has appeared:
From the execution results we found that the word "Spark" appeared 15 times altogether.
At this point, we look at the Web console of the spark shell:
The discovery console shows that we have submitted a task and completed it successfully, and click on the task to see its execution details:
So how do we verify that the spark shell is correct for the 15 occurrences of "spark" in this readme.md file? In fact the method is very simple, we can use the Ubuntu comes with the WC command to statistics, as follows:
The result of this is also 15 times, and the spark shell count is the same.
Step two: Use Spark's cache mechanism to observe the efficiency gains
Based on the above, we are executing the following statement:
The same calculation results are found to be 15.
At this point we are entering the Web console:
The discovery console clearly shows that we performed the "Count" Operation two times.
Now let's take the "Sparks" variable to perform the "cache" Operation:
At this point, in the Count operation, review the Web console:
At this point we found that the three count operations we performed were time-consuming 0.7s, 0.3s, and 0.5s.
At this point we perform the count operation for the fourth time and look at the effect of the Web console:
The clear fourth operation on the console took only 17ms, about 30 times times faster than the first three operations. This is the huge speed boost that the cache brings, and caching is one of the core of Spark's calculations!
Step three: Build the IDE development environment for Spark
Step 1 : Currently the world's leading Inteiiij IDE development tool for Spark is idea, we download Inteiiij idea:
Download here is the latest version 13.1.4:
As for the choice of the version, the official gives the following choice:
We are here to choose the "Community Edition free" version under the Linux system, which fully meets the needs of our arbitrarily complex scala development.
When the home forest download is complete, save it locally at the following location:
Step 2: Install idea and configure the idea system environment variables
Create the "/usr/local/idea" directory:
Unzip the idea tarball we downloaded into the directory:
After the installation is complete, in order to facilitate the use of the commands in its bin directory, we configure it in "~/.BASHRC":
Spark Asia-Pacific Research series "Spark Combat Master Road"-Chapter one building spark Clusters (fourth) (3)
After the configuration is complete, save the exit and execute the source command to make the configuration file effective.
Step 3: The Scala development plugin that runs idea and installs and configures idea:
Official documents indicate:
We enter the bin directory of idea:
At this point, run "idea.sh" appears the following interface:
At this point you need to select "Configure" to enter the idea's configuration page:
Select "Plugins" to enter the plugin installation interface:
At this point, click on the "Install jetbrains plugin" option in the lower left corner to enter the following page:
In the upper left, enter "Scala" in the input box to find the Scala plugin:
At this point, click "Install plugin" on the right:
Select "Yes" to open the Scala plugin's automatic installation process in idea.
It takes about 2 minutes to download and install, of course, and the time taken to download different speeds will vary:
Restart idea at this point:
Restart idea at this point:
After reboot, enter the following interface:
Step 4 : Write Scala code in idea:
First select "Create New Project" in the Enter interface in our previous step:
At this point, select the "Scala" option in the list on the left:
To facilitate future development work, we select the "SBT" option on the right:
Click "Next" to proceed to the next step, setting the Scala project name and directory:
Click "Finish" to complete the project creation:
Since we chose the "SBT" option in front of us, idea is now intelligently helping us build the SBT tool:
We click on the project name "Helloscala":
Idea Auto-completion of the SBT tool installation takes a while, and the home forest takes about 5 minutes, and SBT will automatically help us set up some of the directories:
At this point, right-click Scala under Main src under "New" in the popup to select "Scala Class"
Input file name:
Select Kinde as "Object":
Click "OK" to complete:
Spark Asia-Pacific Research Institute "Spark Asia-Pacific Research series" Spark Combat Master Road-Chapter One building the Spark cluster (fourth) (5)
At this point, our "Firstscalaapp" source code is modified to the following content:
At this point, we click on the "Firstscalaapp" right button to select "Run Scala Console" appears the following prompt:
This is because we have not set the Java JDK path, click "OK", enter the following view:
At this point we select the "Project" option on the leftmost side:
At this point we select the "No SDK" for the "New" primaries as follows:
Click on the "JDK" option:
Select the directory of the JDK that we installed earlier:
Click "OK"
Click the Confirm button:
The following view appears directly in the code area:
We choose "Run ' Firstscalaapp '" To run the program:
The first run of Scala will be somewhat slow, as shown in the running results:
At this point the successful print out "Hello scala! "This string.
This indicates that the program is running successfully.
Step 5 : If we're going to use a cool black background, we can
Theme, appearance, Settings, File --- Select Darcula:
Fourth step: Build and test the spark development environment through the IDE of Spark
Step 1 : Import spark-hadoop corresponding package, select "File" –> "Project Structure" –> "Libraries", select "+" to import Spark-hadoop corresponding package:
Click "OK" to confirm:
Click "OK":
When idea is done, we'll find that Spark's jar package is imported into our project:
Step 2: Development of the first spark program. Open the examples directory that comes with spark:
There are a lot of files inside, and these are the examples that spark gave me.
In our first Scala project, under SRC, create a Scala object named Sparkpi:
At this point, open the Sparkpi file under Spark's own examples:
We copy the contents of this article directly into the SPARKPI created in idea:
Fifth step: Testing the Spark IDE development environment
At this point we directly select SPARKPI and run, the following error message will appear:
As you can see from the tip, you can't find the master machine that the Spark program is running on.
You need to configure the SPARKPI execution environment at this time:
Select "Edit Configurations" to enter the configuration interface:
We enter "local" in program arguments:
This configuration shows that our program runs in local native mode and is configured to be saved.
Run the program again at this point.
Ubuntu under Hadoop,spark Configuration