Hadoop cluster supports three modes of operation: Standalone mode, pseudo-distributed mode, fully distributed mode, below introduction under Ubuntu deployment
(1) Stand-alone mode by default, Hadoop is configured as a standalone Java process running in non-distributed mode, suitable for debugging at the start. Development in Eclipse is a standalone mode, without HDFs. Okay, if the JDK is not installed, the installation steps are as follows: First download the JDK Linux version on the official website, then press the download directly to the appropriate directory so that the JDK is installed. Next, configure the environment variables
Add the following code
The path to the Java_home is the path on your own local machine. You need to log out of the current user and log back in after saving.
Then open console input java-version Displays the Java version and other information indicating that the configuration was successful. Unzip the downloaded Hadoop and rename it to Hadoop (this is for ease of operation later). Into the Conf folder, in the hadoop-env.sh file to make changes in the Nineth line around to the position of #export java_home=******* so to the typeface, first will # (here # for comment to effect) remove, modify Java_ The value of home is the JDK in your machine to the file path, where the value and/etc/profile are the same. You can now run a Hadoop program in stand-alone mode, determining that the current path is a Hadoop folder
Bin/hadoop jar Hadoop-ex*.jar wordcount conf output
Conf is the input folder, output is the Export folder, so ensure that the Conf folder is present, and there are files. (2) pseudo-distributed mode pseudo-distributed mode is a single point of operation mode, all processes (NameNode, secondary NameNode, Jobtracker, Datenode, Tasktracker) are running on the only one node, need to use HDFS. First configure three XML files, file paths in the Conf folder under the Hadoop directory.
Core-site.xml
<configuration> <property> <name>fs.default.name</name> <value>hdfs ://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</ name> <value>/home/****/hadoop/logs</value> </property></configuration>
Hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1 </value> </property></configuration>
Mapred-site.xml
<configuration> <property> <name>mapred.job.tracker</name> <value >localhost:9001</value> </property></configuration>
Next you need to install SSH
Enter in the console directly under Ubnuntu
sudo apt-get install Openssh-server
If you are prompted not to find the source, enter it in the software center
sudo apt-get update
After installing SSH, you need to set the login key, console input
Ssh-keygen-t Dsa-p "-F ~/.SSH/ID_DSA cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
The Ssh-keygen represents the generated key; t (note case sensitive) represents the specified generated key type, DSA is the meaning of DSA key authentication, that is, the key type;-P is used to provide a cipher;-F to specify the generated key file. Cat appends the Id_dsa.pub (public key) to the authorized Authorized_key. This allows the console to enter SSH localhost without a password. Here is the run step (path is the Hadoop path): (1) format the Namenode and open the Hadoop process
Bin/hadoop Namenode-format bin/start-all.sh
Enter JPS to see if the relevant 5 processes are turned on. (2) Create an input folder on HDFs and upload the data
Bin/hadoop Fs-mkdir Input
Input is the name of the created folder, and date is the local data file (the default path is in the Hadoop directory), as for the command Bin/hadoop fs-help view. (3) Run WordCount, view results
Bin/hadoop jar Hadoop-examples-*.jar wordcount Input Output
Bin/hadoop fs-cat output/* >> result.txt
If the jar package is exported by itself, you can use Bin/hadoop jar Own.jar input output directly. The results are placed in the Result.txt file (local Hadoop directory) without adding it and viewing the results in the console. The program can be run through the browser HTTP//Machine name: 50030 View operation (4) Close the Hadoop process
bin/stop-all.sh
(3) Fully distributed mode for Hadoop, different systems have different node partitioning methods. In HDFs's eyes. Nodes are divided into Namenode and Datanode, where datanode can have multiple, and MapReduce appears to have multiple nodes jobtracker and Tasktracker,tasktracker. Almost all-distributed deployment and pseudo-distributed deployment. There are more than two machines, have the same user name (must), determine IP in the same network segment, and can ping each other.
192.168.6.30 Master 192.168.6.31 Node1
(1) The host generates SSH and distributes the key
SSH-KEYGEN-T RSA
Ssh-copy-id-i ~/.ssh/id_rsa.pub User Name @192.168.6.31
If the distribution is unsuccessful, you can use the following command
And then on the remote machine,
mkdir ~/.ssh chmod ~/.ssh mv ~/mas_key ~/.ssh/authorized_keys chmod ~/.ssh/authorized_keys
Using Ssh-copy-id can not only add the public key to Authorized_keys, but also set the correct permissions (folder. SSH is 600 for 700,authorized_keys)
Reference article: Http://www.thegeekstuff.com/2008/11/3-steps-to-perform-ssh-login-without-password-using-ssh-keygen-ssh-copy-id/
SSH password-free login principle can be consulted: http://www.ruanyifeng.com/blog/2011/12/ssh_remote_login.html
This way, you should not enter the password when SSH 192.168.6.31 the host on the master host.
If there is an agent admitted failure to sign using the key this issue
The solution uses the SSH-ADD directive to add the private key in
Ssh-add ~/.ssh/id_rsa
(2) Configuring the Hosts file
sudo gedit/etc/hosts127.0.0.1 localhost #127.0.0.1 machine name 192.168.6.38 Master 192.168.6.31 Node1
* Comments on the second line must be commented out. The Hosts file is then distributed through the SCP to the slave, and the hosts file is moved to the slave machine.
Scp/etc/hosts target machine name @ target machine ip:~/hosts sudo mv hosts/etc/
(3) To modify the Hadoop configuration file a total of 5 files need to be configured, of which three and pseudo-distributed, two of the files in the field needs to change to the host machine name (here is master, can be IP), and two files are masters and slaves
Masters
Slaves
Master
There are two data nodes in this Hadoop cluster, master and slave. * 5 profiles also need to be distributed to the slave (that is, the configuration of Hadoop for the master and slave machines in the cluster is the same) so the configuration of Hadoop is completed, and the SSH node1 command allows you to log in directly. The next steps to run WordCount are the same as for pseudo-distributed mode.Note: Empty the logs directory in the cluster machine before each run.
Deployment of Hadoop three running modes on Ubuntu