Hadoop Learning Notes

Source: Internet
Author: User
Tags mkdir parent directory xsl ssh ssh server

Hadoop Learning Notes

Author: wayne1017

first, a brief introduction

Here is a general introduction to Hadoop.
Most of this article is from the official website of Hadoop. One of them is an introduction to HDFs's PDF document, which is a comprehensive introduction to Hadoop. My this series of Hadoop learning Notes is also from here step-by-step down, at the same time, referring to a lot of articles on the Web, to learn the problems encountered in Hadoop summarized.
Anyway, let's start with the ins and outs of Hadoop. When it comes to Hadoop, you have to mention Lucene and Nutch. First of all, Lucene is not an application, but provides a pure Java high-performance full-text Indexing Engine toolkit, which can be easily embedded in a variety of practical applications to achieve full-text search/indexing capabilities. Nutch is an application, is a lucene based on the implementation of the search engine applications, Lucene provides nutch text search and indexing Api,nutch not only the search function, but also the function of data capture. Before the nutch0.8.0 version, Hadoop was part of the Nutch, and from nutch0.8.0, the NDFs and MapReduce that were implemented in it were stripped out to create a new open source project, which was Hadoop, and the nutch0.8.0 version was more than the previous nutch in the architecture The fundamental change is that it is entirely built on the basis of Hadoop. Google's GFS and MapReduce algorithms are implemented in Hadoop, making Hadoop a distributed computing platform.
In fact, Hadoop is not just a distributed file system for storage, but a framework designed to perform distributed applications on a large cluster of general-purpose computing devices.

Hadoop contains two parts:

1, HDFS

Hadoop Distributed File System (Hadoop distributed filesystem)
HDFs is highly fault tolerant and can be deployed on low-priced hardware devices. HDFs is ideal for applications with large datasets and provides a high throughput for reading and writing data. HDFs is a master/slave structure that, for normal deployments, runs only one namenode on master and one datanode on each slave.
HDFs supports the traditional hierarchical file organization structure, which is similar to some existing file systems, such as you can create and delete a file, move a file from one directory to another, rename, and so on. Namenode manages the entire distributed file system, and the operation of file systems (such as creating, deleting files, and folders) is controlled by Namenode.
The following is the structure of the HDFS:


As can be seen from the above diagram, communication between namenode,datanode,client is based on TCP/IP. When a client wants to perform a write operation, the command is not sent immediately to Namenode,client first cache the data in the temporary folder on this computer, and when the data blocks in the temporary folder reach the value of the set block (the default is 64M), The client notifies the Namenode,namenode of the client's RPC request, inserts the file name into the filesystem hierarchy, and finds a block in the Datanode that holds the data. At the same time, the Datanode and the corresponding data block information are told to the Client,client to write the data blocks in these local temporary folders to the specified data node.
HDFs has taken a replica strategy to improve the reliability and usability of the system. The replica placement policy for HDFS is three replicas, one on this node, one on another node in the same rack, and one copy on a node in a different rack. The current version of the hadoop0.12.0 has not yet been implemented, but is in progress, I believe it will soon be out.

2, the realization of MapReduce

MapReduce is an important technology for Google, a programming model for computing large amounts of data. For the calculation of large amount of data, the usual processing method is parallel computation. For many developers at least at this stage, parallel computing is a far more distant thing. MapReduce is a programming model that simplifies parallel computing, allowing developers with little parallel computing experience to develop parallel applications.
MapReduce's name stems from two core operations in this model: map and Reduce. Perhaps people who are familiar with functional programming (functional programming) will be very kind to see these two words. In short, map is a one-to-one mapping of a set of data to another set of data, whose mapping rules are specified by a function, such as the mapping of [1, 2, 3, 4] multiplied by 2 becomes [2, 4, 6, 8]. Reduce is a set of data to be reduced, the rule of the reduction is specified by a function, such as [1, 2, 3, 4] The sum of the result is 10, and the result of the product is 24.
Regarding the content of MapReduce, we suggest to see this Meng mapreduce:the free lunch are not over!

Well, as the first part of this series is written so much, I am also just beginning to touch Hadoop, the next is to talk about the deployment of Hadoop, I am in the deployment of Hadoop encountered problems, but also to everyone a reference, a little detour.

II. Installation and Deployment

This article is mainly about installing and using hadoop-0.12.0 as an example, pointing out the problems that are easy to meet when you deploy Hadoop and how to solve it.


Hardware environment
A total of 3 machines, all using the FC5 system, Java is using jdk1.6.0. The IP configuration is as follows:
dbrg-1:202.197.18.72
dbrg-2:202.197.18.73
dbrg-3:202.197.18.74

One thing to emphasize here is that it is important to ensure that each machine's hostname and IP address are resolved correctly.

A very simple test is to ping the host name, such as Ping dbrg-2 on the dbrg-1, if you can ping the OK. If not correctly resolved, you can modify the/etc/hosts file, if the machine for namenode use, you need to add in the Hosts file all the machines in the cluster IP address and its corresponding host name; If the machine is Datanode, You only need to add the native IP address and the IP address of the Namenode machine to the Hosts file.

For example, the/etc/hosts file in dbrg-1 should look like this:
127.0.0.0 localhost localhost
202.197.18.72 dbrg-1 dbrg-1
202.197.18.73 dbrg-2 dbrg-2
202.197.18.74 dbrg-3 dbrg-3

The/etc/hosts file in dbrg-2 should look like this:
127.0.0.0 localhost localhost
202.197.18.72 dbrg-1 dbrg-1
202.197.18.73 dbrg-2 dbrg-2

As mentioned in the previous learning note, for Hadoop, in the case of HDFs, nodes are divided into Namenode and Datanode, of which there is only one namenode and datanode can be many; in MapReduce's view, Nodes are divided into Jobtracker and Tasktracker, of which jobtracker only one, Tasktracker can be a lot.
I am deploying Namenode and Jobtracker on Dbrg-1, dbrg-2,dbrg-3 as Datanode and Tasktracker. Of course, you can also deploy Namenode,datanode,jobtracker,tasktracker to a single machine.


Directory Structure
Because Hadoop requires the deployment directory structure of Hadoop on all machines to be the same, and all have an account with the same user name.
My three machines are like this: all have a DBRG account, the home directory is/HOME/DBRG
the Hadoop deployment directory structure is as follows:/home/dbrg/hadoopinstall, all versions of Hadoop are placed in this directory.
to extract the hadoop0.12.0 compression pack into Hadoopinstall for easy upgrades later, it is recommended that you set up a link to the version of Hadoop you want to use, which you may wish to set as Hadoop
[Dbrg@dbrg-1:hadoopinstall ] $LN  -s  hadoop0.12.0   Hadoop
This way, all of the configuration files are in the/hadoop/conf/directory, and all the execution programs are in the/hadoop/bin directory.
However, because the configuration file for Hadoop in the above directory is set together with the installation directory of Hadoop, it is recommended that the configuration file be separated from the installation directory once all profiles are overwritten when the Hadoop version is later upgraded. A better approach would be to create a directory of configuration files,/home/dbrg/hadoopinstall/hadoop-config/, and then/hadoop/conf/the HADOOP_ in the directory Site.xml,slaves,hadoop_env.sh three files to the hadoop-config/directory (the question is very strange, on the official web getting started with Hadoop says it is only necessary to copy the three files to the directory that you created, but I found it necessary to copy the Masters file into the hadoop-conf/directory when I was actually configured. Otherwise, when you start HADOOP, you'll get an error saying you can't find Masters this file, and specify that the environment variable $hadoop_conf_dir point to the directory. Environment variables are set in/HOME/DBRG/.BASHRC and/etc/profile.
To sum up, to facilitate later upgrades, we need to separate the configuration file from the installation directory and, by setting a link to the version of Hadoop that we want to use, we can reduce our maintenance of the configuration file. In the following sections, you will experience the benefits of such separation and links.


SSH Settings
When Hadoop is started, Namenode starts and stops various daemons on each node via SSH (Secure Shell), which requires that you do not need to enter a password when executing instructions between nodes. So we need to configure the way SSH uses a password-free public key authentication. The
first ensures that the SSH server is installed on each machine and starts normally. In practice we are using OpenSSH, which is a free open source implementation of the SSH protocol. The OpenSSH version of the default installation in FC5 is OPENSSH4.3P2.
Take the three machines in this article for example, now Dbrg-1 is the master node, it needs to initiate an SSH connection to dbrg-2 and dbrg-3, for SSH services, dbrg-1 is the SSH client, and Dbrg-2, dbrg-3 is the SSH server, It is therefore necessary to determine on the dbrg-2,dbrg-3 that the SSHD service has started. Simply put, a key pair, a private key and a public key, need to be generated on the dbrg-1. Copy the public key to the dbrg-2,dbrg-3, so that, for example, when the dbrg-1 initiates an SSH connection to dbrg-2, a random number is generated on dbrg-2 and the random number is encrypted with the Dbrg-1 public key and sent to the Dbrg-1 Dbrg-1 received the encrypted number after the decryption with the private key, and the number of decrypted sent back to Dbrg-2,dbrg-2 to confirm the number of decrypted after the error allows dbrg-1 to connect. This completes a public key authentication process.

For the three machines in this article, first generate the key pair on the dbrg-1:
[dbrg@dbrg-1:~] $ssh-keygen-t RSA
This command will generate a key pair for the user dbrg on the dbrg-1, asking them to return directly to the default path when they save the path, and when prompted to enter passphrase for the generated key, direct return, that is, set it to a blank password. The generated key pair id_rsa,id_rsa.pub, which is stored in the/HOME/DBRG/.SSH directory by default. The contents of the id_rsa.pub are then copied to the/home/dbrg/.ssh/authorized_keys file of each machine (also including the native), and if the Authorized_keys file is already on the machine, add id_ to the end of the file. Rsa.pub content, if not authorized_keys this file, direct CP or SCP, the following operation assumes that there are no Authorized_keys files on each machine.

For dbrg-1
[Dbrg@dbrg-1:.ssh] $CP id_rsa.pub Authorized_keys

For Dbrg-2 (dbrg-3 with Dbrg-2 method)
[dbrg@dbrg-2:~] $mkdir. SSH
[Dbrg@dbrg-1:.ssh] $SCP Authorized_keys dbrg-2:/home/dbrg/.ssh/
The SCP here is remote copy via SSH, where you need to enter the password for the remote host, that is, the password for the DBRG account on the dbrg-2 machine, and you can, of course, copy authorized_keys files to other machines in other ways.

[Dbrg@dbrg-2:.ssh] $chmod 644 Authorized_keys
This step is critical and you must ensure that Authorized_keys only has read and write access to its owner, and that other people do not allow write permission, or SSH will not work. I have been in the configuration ssh when the depressed for a long time.

[Dbrg@dbrg-2:.ssh]ls-la
DRWX------2 DBRG DBRG.
DRWX------3 DBRG DBRG.
-rw-r--r--1 DBRG DBRG Authorized_keys
Note that the ls-la of the. SSH directory on each machine should be the same as the above

Then, on the three machines are required to configure the SSHD service (in fact, can not be configured, complete the above operations after the SSH can already work), on the three machines to modify the file/etc/ssh/sshd_config
#去除密码认证
Passwordauthentication No
Authorizedkeyfile. Ssh/authorized_keys

The SSH configuration on each machine has been completed and can be tested, such as dbrg-1 SSH connection to Dbrg-2
[dbrg@dbrg-1:~] $ssh dbrg-2
If the SSH configuration is OK, the following message will appear
The authenticity of host [dbrg-2] can ' t be established.
Key fingerprint is 1024 5f:a0:0b:65:d3:82:df:ab:44:62:6d:98:9c:fe:e9:52.
Are you sure your want to continue connecting (yes/no)?
OpenSSH tells you it doesn't know this host, but you don't have to worry about it because you're the first to log on to this host. Type "yes". This will add the "identification tag" of this host to the "~/.ssh/know_hosts" file. This message is no longer displayed when you visit this host for the second time.
Then you will find that you do not need to enter a password to establish an SSH connection, congratulations, the configuration was successful
But don't forget to test native SSH dbrg-1


Hadoop Environment variables
Set the environment variable that Hadoop needs in the hadoop_env.sh in the/home/dbrg/hadoopinstall/hadoop-conf directory, where Java_home is the variable that must be set. Hadoop_home variable can be set or not set, if not set, hadoop_home default is the bin directory of the parent directory, that is, the/home/dbrg/hadoopinstall/hadoop in this article. That's how I set it up.
Export Hadoop_home=/home/dbrg/hadoopinstall/hadoop
Export java_home=/usr/java/jdk1.6.0
From this place you can see the advantages of creating the hadoop0.12.0 link Hadoop as described earlier, and when you update the Hadoop version later, you don't need to change the configuration file, just make changes to the link.


Hadoop configuration file
As mentioned earlier, in the hadoop-conf/directory, open the slaves file, which is used to specify all from the node, one row specifies a host name. This is the dbrg-2,dbrg-3 in this article, so the slaves file should look like this
Dbrg-2
Dbrg-3
The hadoop-default.xml in the conf/directory contains all of the configuration items for Hadoop, but no direct modifications are allowed. You can define the items we need in the Hadoop-site.xml in the hadoop-conf/directory, and the values will overwrite the defaults in Hadoop-default.xml. Can be customized according to their actual needs. The following are my profile files:
<?xml version= "1.0"?>
<?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?>
<!--put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>dbrg-1:9000</value>
<description>the name of the default file system. Either the literal string "local" or a host:port for dfs.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>dbrg-1:9001</value>
<description>the host and port that's MapReduce job tracker runs at. If ' local ', then jobs are run in-process as a single map and reduce task.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/dbrg/HadoopInstall/tmp</value>
<description>a base for other temporary directories.</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/dbrg/HadoopInstall/filesystem/name</value>
<description>determines where on the "local filesystem" the DFS name node should store the name table. If This is a comma-delimited list of directories then the name table is replicated in all of the directories for Redundan Cy. </description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/dbrg/HadoopInstall/filesystem/data</value>
<description>determines where on the local filesystem a DFS data node should store its blocks. If is a comma-delimited list of directories, then data would be stored in all named directories, typically on Differen T devices. Directories that does not exist are ignored.</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>default block replication. The actual number of replications can be specified to the file is created. The????? used if replication is not specified in Create time.</description>
</property>
</configuration>


Deploying Hadoop
With so many Hadoop environment variables and configuration files on the dbrg-1 machine, it is now time to deploy Hadoop to other machines to keep the directory structure consistent.
[dbrg@dbrg-1:~] $SCP-R/home/dbrg/hadoopinstall dbrg-2:/home/dbrg/
[dbrg@dbrg-1:~] $SCP-R/home/dbrg/hadoopinstall dbrg-3:/home/dbrg/
Now, you can say that Hadoop has been deployed on every machine, so let's start with Hadoop now.


Start Hadoop
Before we start, we need to format the Namenode first, enter the ~/hadoopinstall/hadoop directory, and execute the following command
[Dbrg@dbrg-1:hadoop] $bin/hadoop Namenode-format
No surprises, you should be prompted to format successfully. If it doesn't work, go to the hadoop/logs/directory to view the log file
Now it's time to officially start Hadoop, and there are a lot of startup scripts under bin/that can be started according to your needs.
* Start-all.sh starts all the Hadoop daemons. including Namenode, Datanode, Jobtracker, Tasktrack
* stop-all.sh to stop all Hadoop
* start-mapred.sh start map/reduce Guardian. including Jobtracker and Tasktrack.
* Stop-mapred.sh Stop Map/reduce Guard
* start-dfs.sh start Hadoop dfs daemon Namenode and Datanode
* Stop-dfs.sh Stop Dfs Guardian

Here, simply start all the Guardian
[Dbrg@dbrg-1:hadoop] $bin/start-all.sh

Similarly, if you want to stop Hadoop, you
[Dbrg@dbrg-1:hadoop] $bin/stop-all.sh


HDFs operation
Run the Hadoop command in the bin/directory to view all supported operations and their usage by haoop, for example, with a few simple actions.

Create a Directory
[Dbrg@dbrg-1:hadoop] $bin/hadoop dfs-mkdir TestDir
Create a directory named TestDir in HDFs

Copying files
[Dbrg@dbrg-1:hadoop] $bin/hadoop dfs-put/home/dbrg/large.zip testfile.zip
Copy the local file Large.zip to the root directory of the HDFs/user/dbrg/, the file name is Testfile.zip

View Existing Files
[Dbrg@dbrg-1:hadoop] $bin/hadoop dfs-ls

Source:
Http://www.cnblogs.com/wayne1017/archive/2007/03/18/668768.html
Http://www.cnblogs.com/wayne1017/archive/2007/03/20/678724.html


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.