Install hadoop in Centos and connect to Eclipse
Hadoop has been planned for a long time and has not been put on the agenda until recently. It took some time to build hadoop under centos. The "setbacks" that we experienced before and after can definitely be written into a tear-down history of thousands of characters. There have been both online tutorials, as well as the love support of big super senior brother of the Teaching and Research Section and bubble brother of the internship company. Today, we can finally sit down and talk about how to install hadoop in Centos and connect it to Eclipse.
First, let's talk about the software and information to be prepared:
VMware-workstation;
CentOS-6.0-i386-bin-DVD;
Eclipse-jee-luna-SR1-win32;
Hadoop-0.20.2;
Jdk-6u27-linux-i586;
(Because hadoop has high requirements on versions, we should not change the version easily. The software listed here is a stable release version, which can be easily downloaded online ).
The entire tutorial is divided into five parts: 1) install the Virtual Machine Vmware in Windows and create a new virtual machine to install the centos system; 2) Set ssh service password-less logon in centos; 3) install jdk in centos and configure environment variables; 4) install hadoop in centos and configure the file; 5) install jdk and eclipse in windows and connect eclipse to hadoop in centos. Each of these five parts is important, especially step 1. Next we will detail how to do this in each step.
Step 0: create a common user in windows. The user name is hadoop. All our software is installed in this directory. It is best to use hadoop as the user name, this is because it should be the same as many of the following usernames. It is easier to set it to hadoop.
1) install the Virtual Machine Vmware in Windows and create a new virtual machine to install the centos system;
First, download and install VMware-workstation. This step is the same as installing software in windows, this saves some space for the next important step ~
Then, create a new virtual machine on the home page of Vmware, such:
Go to the next step until you select the system image path. Select the centos system image, for example, click Next. Then, you need to enter the linux User name. This is important. You 'd better enter hadoop, because this name will be used many times later!
The next step is "Next" until you set the memory size of the VM. We recommend that you set it to 1024 MB. For example. Next, you need to select the settings related to the network type of the virtual machine. We recommend that you use Network Address Translation NAT, as shown in. In this step, I selected the auto-bridging function and found an error for one night... Time is gone ~~
The next step is almost all the Recommended settings. We can create a new centos and wait a few minutes before entering the centos interface. Have you ever been tempted to see that touch of technology blue ~~ Haha, you have taken the first step!
2) Set ssh service password-less logon in centos;
Right-click the desktop and select openin Terminal. This is the linux Terminal. We hope that you will have some basic linux operating systems to get started faster. However, if it doesn't exist, it doesn't matter. We are a beginner's tutorial.
2. 1. first, enter su in the linux Command Line, prompt the password, and enter your own password, so that your subsequent operations have the highest permission under linux-root permission.
2. Before setting ssh password-free logon, you must first disable SELinux. This is because centos will automatically prevent you from modifying sshservice. We can only restart SELinux to make the modification take effect. How to do this is as follows:
Modify the/etc/selinux/config file
Change SELINUX = enforcing to SELINUX = disabled
Restart the machine.
(Note: Modify the file in linux. After the vi command is run, the file window is displayed. Press I to enter insert. After the modification is complete, Press esc to launch insert and enter;: wq! Save and exit ~ Thanks to bubble brother, I can't do it for half a day, but I am still confused by bubble brother ~~)
. Enter ssh-keygen-trsa in the linux Command Line and press Enter.
Root @ hadoopName-desktop :~ $ Ssh-keygen-t rsa
Generatingpublic/privatersakey pair.
Enterfileinwhich to save the key (/home/zhangtao/. ssh/id_rsa): // key storage location. Press enter to keep the default value;
Createddirectory '/home/zhangtao/. ssh '.
Enter passphrase (emptyforno passphrase): // set the key password. If you have no password, press Enter;
Enter samepassphrase again: // confirm the password set in the previous step.
Go to the/root/. ssh/directory and you will see two files: id_rsa.pub, id_rsa,
Then run cpid_rsa.pub authorized_keys
Then ssh localhost is used to verify whether the request is successful. If you enter yes for the first time, you will not need it in the future.
For example, because I have verified it again, I still need to enter y, if you are the first verification is not required.
Now, no password is set for the ssh service!
3) install jdk under centos and configure environment variables;
This step can be divided into two steps: Install jdk and configure jdk environment variables.
3. 1. step 1: log on to the root user, run the command mkdir/usr/program to create a directory/usr/program, download the JDK installation package jdk-6u13-linux-i586.bin, copy it to the directory/usr/program, run the cd command to enter the Directory and run the command ". /jdk-6u13-linux-i586.bin ", after the command is run is completed, the directory will be generated/jdk1.6.0 _ 13, this is jdk is successfully installed to the directory: /usr/program/jdk1.6.0 _ 13.
3.2.log on as the root user, execute the command "vi/etc/profile" in the command line, and add the following content to configure the environment variables (note: the/etc/profile file is very important, later Hadoop configuration will be used ).
# Set java environment
ExportJAVA_HOME =/usr/program/jdk1.6.0 _ 27
ExportJRE_HOME =/usr/program/jdk1.6.0 _ 27/jre
Export CLASSPATH =.: $ JAVA_HOME/lib: $ JAVA_HOME/jre/lib
Export PATH = $ JAVA_HOME/bin: $ JAVA_HOME/jre/bin: $ PATH
Add the above content in the vi Editor, save and exit, and execute the following command to make the configuration take effect!
# Chmod + x/etc/profile; add execution permission
# Source/etc/profile; make the configuration take effect!
After the configuration is complete, enter java-version in the command line to display the jdk installation information.
In this case, jdk installation and configuration of environment variables are successful ~
4) install hadoop in centos and configure the file;
. Before installing hadoop, you need to know your IP address in centos: Enter ifconfig in the terminal to view the IP address. Example: (my name is 192.168.154.129)
4. 2. download the hadoop-0.20.2.tar.gz and copy it to the/usr/local/hadoop directory, then unzip the installation Build File/hadoop-0.20.2 under the directory/usr/local/hadoop (that is, hadoop is installed to/usr/local/hadoop/hadoop-0. in the 20.2 folder ).
The command is as follows: tar-zxvf hadoop-0.20.2.tar.gz unzip the installation step!
. Configure the hadoop environment variables first
Command "vi/etc/profile"
# Set hadoop
Export hadooop_home =/usr/hadoop/hadoop-0.20.2
Export PATH = $ HADOOP_HOME/bin: $ PATH
Command: source/etc/profile to make the configuration file take effect!
Go to/usr/local/hadoop/hadoop-0.20.2/conf to configure the Hadoop configuration file
4. Configure the hadoop-env.sh File
Open file command: vihadoop-env.sh
Add # set javaenvironment
Export JAVA_HOME =/usr/program/jdk1.6.0 _ 27
Save and exit after editing (prompt, enter: wq !). In fact, carefully looking for it will find that the hadoop-env.sh file itself has JAVA_HOME this line, we only need to delete the note # above, and then modify the HOME address just fine. As shown in:
4. 5. Configure core-site.xml
[Root @ master conf] # vi core-site.xml
<? Xml version = "1.0"?>
<? Xml-stylesheettype = "text/xsl" href = "configuration. xsl"?>
<! -- Put site-specific property overridesin this file. -->
<Configuration>
<Property>
<Name> fs. default. name </name>
<Value> hdfs: // 192.168.154.129: 9000/</value>
</Property>
<Property>
<Name> hadoop. tmp. dir </name>
<Value>/usr/local/hadoop/hadoop-0.20.2/hadooptmp </value>
</Property>
</Configuration>
(Note: The IP address after hdfs must be your centos IP address, which is why ifconfig needs to get the IP address first. The localhost mentioned in some tutorials is incorrect! It will not be connected to eclipse later !! This consumes another night's time ...)
As shown in:
Note: hadoop distributed file system has two important directory structures: one is the place where namenode is stored, the other is the place where datanode data blocks are stored, and some other places where files are stored, these storage locations are based on hadoop. tmp. dir directory. For example, the namespace of namenode is $ {hadoop. tmp. dir}/dfs/name, where the datanode data block is stored $ {hadoop. tmp. dir}/dfs/data, so hadoop is set. tmp. after the dir directory, other important directories are under this Directory, which is a root directory. I set/usr/local/hadoop/hadoop-0.20.2/hadooptmp, of course this directory must exist.
4. 6. Configure hdfs-site.xml
<? Xml version = "2.0"?>
<? Xml-stylesheet type = "text/xsl" href = "configuration. xsl"?>
<! -- Put site-specific property overridesin this file. -->
<Configuration>
<Property>
<Name> dfs. replication </name>
<Value> 1 </value>
</Property>
<Property>
<Name> dfs. permissions </name>
<Value> false </value>
</Property>
</Configuration>
Hdfs-site.xml "15L, 331C
(Note: The value of dfs. replication is 1 because we have configured a standalone pseudo-distributed system with only one slave ~ The following dfs. permissions is used to grant permissions to users ~)
4. 7. Configure mapred-site.xml
[Root @ master conf] # vi mapred-site.xml
<? Xml version = "1.0"?>
<? Xml-stylesheettype = "text/xsl" href = "configuration. xsl"?>
<! -- Put site-specific property overridesin this file. -->
<Configuration>
<Property>
<Name> mapred. job. tracker </name>
<Value> 192.168.154.129: 9001 </value>
</Property>
</Configuration>
For example:
4.8.masters file and slaves file (generally, the default content of these two files is the following content without reconfiguration)
[Root @ hadoop conf] # vi masters
192.168.154
[Root @ hadoop conf] # vi slaves
192.168.154
Note: In pseudo-distribution mode, the namenode serving as the master and the datanode serving as the slave are the same server, so the ip address in the configuration file is the same.
4.9. Host Name and IP Address Resolution configuration (this step is very important !!!)
First, [root @ hadoop ~] # Vi/etc/hosts,
Then [root @ hadoop ~] # Vi/etc/hostname,
Last [root @ hadoop ~] # Vi/etc/sysconfig/network.
Note: The configurations in these three locations must be consistent so that Hadpoop can work normally! Host Name configuration is very important!
4.9. Start hadoop
Enter the/usr/local/hadoop/hadoop-0.20.2/bin directory and enter hadoop namenode-format to format namenode.
Start all hadoop processes and enter the start-all.sh:
Verify if hadoop is available. Enter jps:
If Tasktracker, JobTracker, DataNode, and Namenode are all in the Red Circle, your hadoop installation is successful!
Note: 1. secondaryname is a backup of namenode, which also stores the map relationship between the namespace and the file to the file block. It is recommended to run on another machine. After the master is dead, you can use the machine where secondaryname is located to retrieve the namespace, map the relationship data with the file block, and restore namenode.
2. after the startup, the data directory will be generated in the dfs folder under/usr/local/hadoop/hadoop-1.0.1/hadooptmp, where the data block data on datanode is stored, because I use a single machine, the name and data are on one machine. If it is a cluster, only the name folder will be available on the machine where namenode is located, and only the data folder will be available on datanode.
5) install jdk and eclipse in windows, and connect eclipse to hadoop in centos;
To install jdk in windows, download jdk in windows. Eclipse directly decompress the installation package. We will mainly talk about how to connect eclipse to hadoop.
. First, disable the linux firewall;
Shut down the linux firewall before connection, otherwise the "listing folder content…" message will always be prompted in the eclipse project ...", The connection fails. To disable the firewall, follow these steps:
Run chkconfig iptables off and restart reboot.
After the restart, enter the command:/etc. init. d/iptables status to view the firewall closing status. ("Iptables: Firewall is not running." is displayed .")
. Install and configure eclipse parameters for the plug-in
Download the plug-in hadoop-eclipse-plugin-0.20.3-SNAPSHOT, put it in the plugins folder under eclipse, restart eclipse to find the DFS locations in the project explorer column. For example:
In windows-preferences, an additional hadoopmap/reduce option is displayed. Select this option and select the downloaded hadoop root directory.
Open map/reducelocation in the view and you will find the yellow elephant icon in the following Location area. In the Loctaion blank area, right-click: New Hadoop Location...
Configure general parameters as follows:
Click finish, close eclipse, and restart eclipse. A purple elephant is found in the Location Area, right-click it, and configure the parameters of Advanced parameters. (Note that the process of shutting down and restarting eclipse must be strictly enforced. Otherwise, some parameters will not be available on the advanced parameters page ~)
Setting the parameters on the advancedparameters page is the most time-consuming. A total of three parameters need to be modified. At the beginning, it was found that it could not be found on the page:
The first parameter is hadoop. tmp. dir. The default value is/tmp/hadoop-{user. name}, because we have hadoop in ore-defaulte.xml. tmp. dir is set to/usr/local/hadoop/hadoop-0.20.2/hadooptmp, so here we also change to/usr/local/hadoop/hadoop-0.20.2/hadooptmp, others will also be automatically modified based on this directory attribute;
The second parameter is dfs. replication. The default here is 3, because we set to 1 in the hdfs-site.xml, so here also to set to 1;
The third parameter is hadoop. job. ugi. Enter hadoop, Tardis, followed by the connected hadoop user, and Tardis.
(Note: here we will talk about the first parameter hadoop. tmp. dir is good. Follow the previous steps to restart eclipse. On the advanced paramters page, you can find it and modify it. The next two parameters are hard to come out, among which hadoop. job. the ugi parameter must ensure that the linux user name is the same as the windows user name, otherwise it cannot be found. Until now, I do not know why these two parameters are sometimes not found. I only need to close them more-Restart eclipse several times and try again. This problem has not been involved in online tutorials)
5.3.the hdfs file directory of hadoop is displayed under the project directory.
After the preceding settings, we will find that the most important hdfs directory in hadoop has been displayed in the project. For example:
Now, hadoop and eclipse are successfully connected. Our tutorial is complete. Next time, let's talk about how to run the WordCount program on eclipse in hadoop.