Using Hadoop to build distributed storage and distributed computing cluster

Last Update:2015-03-17 Source: Internet

Author: User

Keywords SSH Dot ewine

Tags *.h file apache computing copy cpu create created data

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. List the machines used

General PC, requirements:
cpu:750m-1g
MEM: >128m
Disk: >10g
Don't need too expensive machines.

Machine Name:
Finewine01
Finewine02
Finewine03

The FINEWINE01 is set as the primary node and the other machine is from the node.

2. Download and build

From here Checkout, I choose trunk
http://svn.apache.org/repos/asf/lucene/hadoop/
Build with Ant

3. Pre-Deployment preparatory work
After the master node's start-all.sh script executes, the master node and all services from the node run. That is, the script will start the service of the master node and SSH to the node from all nodes, and then start the service from the node.

Start-all.sh This script assumes that Hadoop is installed in the same location on all machines, and that each machine holds data for Hadoop using the same path.

We need to create the same directory structure on each machine.
/hadoop
Installation location for the 0.10.0 version of/hadoop-install/hadoop-0.10.0 Hadoop
/filesystem Hadoop File system root
FileSystem user Hadoop's home directory

Log on to all machines with root, creating the Hadoop user and directory structure.
Ssh-l Root Finewine01
Mkdir/hadoop
Mkdir/hadoop/hadoop-install
Mkdir/hadoop/filesystem
Mkdir/hadoop/home
Groupadd Hadoop
useradd-d/hadoop/home-g Hadoop Hadoop
Chown-r Hadoop:hadoop/hadoop
passwd Hadoop Hadooppassword

start-all.sh script to start all machine services, the need for all machines ssh password-free login capabilities. So we need to create an SSH key on each machine. In this example, the primary node also needs to start its own service, so the primary node will also need to do a password-free SSH login setting.

With VI edit/hadoop/hadoop-install/hadoop-0.10.0/conf/hadoop-env.sh, set the following environment variables:

Export hadoop_home=/hadoop/hadoop-install/hadoop-0.10.0
Export java_home=/usr/java/jdk1.5.0_06
Export Hadoop_log_dir=${hadoop_home}/logs
Export Hadoop_slaves=${hadoop_home}/conf/slaves

This file also has a number of variables that affect the operation of Hadoop. For example, if you have an SSH error when you execute the script, you need to adjust the hadoop_ssh_opts variable.
Also note that after the initial copy operation, it is necessary to set the Hadoop_master variable in the hadoop-env.sh file so that the program can synchronize the master node through rsync to all the nodes.

Create the SSH keys on the master node and copy them to each node. These operations must be done by a previously created Hadoop user. Do not su as a Hadoop user. Start a new shell and log in as a Hadoop user to do this.
Cd/hadoop/home
ssh-keygen-t RSA (use empty responses for each prompt)
Enter passphrase (empty for no passphrase):
Enter Mahouve Passphrase Recycle:
Your identification super-delegates been saved In/hadoop/home/.ssh/id_rsa.
Your public key super-delegates been saved in/hadoop/home/.ssh/id_rsa.pub.
The key fingerprint is:
A6:5C:C3:EB:18:94:0B:06:A1:A6:29:58:FA:80:0A:BC Nutch@localhost

On the master node, copy the public key just created to a file named Authorized_keys:
Cd/hadoop/home/.ssh
CP Id_rsa.pub Authorized_keys

Simply run the Ssh-kegen program on the primary node. After the directory structure of the other nodes is created, copy the keys that were created in the primary node through the SCP to the same directory from the node.
Scp/hadoop/home/.ssh/authorized_keys Hadoop@finewine02:/hadoop/home/.ssh/authorized_keys
For the first time, you need to enter the password for the Hadoop user. The first time you log on to another machine, SSH prompts you to choose to add the machine to the list of known machines and select Yes. After this key file is copied, the password is not required to log in as Hadoop from the master node to the node.
You can test from the master node as Hadoop:
SSH FINEWINE02
The next command prompt will appear directly without the need for a password.

Once you have successfully created the SSH keys on all machines, you can start deploying Hadoop from the node.

4. Deploy Hadoop to a machine

First, we deploy Hadoop to a node (the master node). Make sure that the other from the node is added after normal operation. All of the following actions are performed by a logged-on user who is logged on to the Hadoop.
Cp-r/path/to/build/*/hadoop/hadoop-install/hadoop-x.x.x

Then make sure that these shell script files are in UNIX format and are executable (these files are in the bin and/conf directories, respectively).

A hadoop-site.xml example:

Fs.default.name
finewine01:9000

The name of the default file system. Either the literal string
' Local ' or a host:port for NDFs.

Mapred.job.tracker
finewine01:9001

The host and port that's MapReduce job tracker SETUPCL at. If
' Local ', then jobs are run in-process as a single map and
Reduce task.

Mapred.map.tasks
2

Define MAPRED.MAP tasks to be number of slave hosts

Mapred.reduce.tasks
2

Define Mapred.reduce tasks to be number of slave hosts

Dfs.name.dir
/hadoop/filesystem/name
Dfs.data.dir
/hadoop/filesystem/data
Mapred.system.dir
/hadoop/filesystem/mapreduce/system
Mapred.local.dir
/hadoop/filesystem/mapreduce/local
Dfs.replication
1
Fs.default.name//Default File System "local" or "Host:port"

Hadoop contains two components, namely Distributed file system and MapReduce functionality. The Distributed File system allows you to store and copy files on multiple common machines. MapReduce can make it easy for you to perform parallel program tasks.

The Distributed File system contains the name node and the data node. When a client wants to manipulate a file on a file system, it first contacts the name node, and the name node tells it to get the file on that data node. The name node is responsible for scheduling and saving those blocks of data to be saved and replicated on those machines. Data nodes are data warehouses that store real blocks of file data. When you run the service of the name node and the data node on the same machine, it communicates through the sockets as well as on different machines.

MapReduce is a distributed operation, just like a distributed file system, but one operational operation of distribution, not a file. Responsible for MapReduce dispatch server called MapReduce Job Tracker. Each node that performs operations has a daemon called Task Tracker,task tracker running and communicating with the job tracker.

The primary node and the communication from the node are performed in continuous heartbeat (5-10 seconds). If you stop heartbeat from a node, the master will assume that the node is invalidated from the node and is no longer in use.

Mapredu.job.traker//MapReduce master node, "local" or "Host:port"

Mapred.map.tasks and Mapred.reduce.tasks are used to set the number of parallel tasks.

Dfs.name.dir//Name nodes are used to store tracking and scheduling information for data nodes

Dfs.data.dir//Data nodes are used to store actual blocks of data

Mapred.system.dir//MapReduce Tasker stores its own data, only on the machine where Tasker is located, not on the MapReduce host

Mpred.local.dir//MapReduce stores its own local data on the node. MapReduce uses huge local space to perform its tasks. When the tasks exit, the intermediate files produced by the MapReduce are not deleted. On each node, this property is the same.

Dfs.replication//redundancy, how many machines a single file will be copied to. This value cannot be higher than the number of all data nodes. Otherwise, you will see a lot of error messages when you start the daemon.

Before you start the Hadoop service, be sure to format the name node

Bin/hadoop Namenode-format

Now it's time to start the Hadoop service.

bin/start-all.sh

To stop the Hadoop service you can use the following command

bin/stop-all.sh

If set correctly, you will see the normal output information

5. Deploy Hadoop to multiple machines

Once you run Hadoop successfully on a single machine, you can copy the configuration file to another machine.
As：
Scp-r/hadoop/hadoop-install/hadoop-x.x.x hadoop@finewine02:/hadoop/hadoop-install/

Perform this operation on each node machine. Then edit the slaves file and add each slave to this file, one for each line. Edit Hadoop-site.xml values to modify the number of map and reduce tasks. Modify the Replication property.

6. Distributed Search

Product system generally 1 million records per index inventory. 50 servers handle more than 20 requests per second.
Multiprocessor, multiple-disk systems, each service uses a separate disk and index, so that the cost of the machine can be reduced by 50%, the power contract to 75%. A multiple-disk machine cannot handle as many queries per second as a single disk machine, but it can handle a larger number of indexes, so it's more efficient on average.

7. Sync code to from node

Hadoop provides the ability to sync code to a node. This feature is optional because it slows down the service startup and sometimes you don't want to sync to the node.

Although the node can be synchronized with the master node, at the first time you still need to install the basic to the node, even the synchronized script takes effect. We've done all of these jobs above, so there's no need for change.

The synchronization is initiated by the master node ssh to the from node and executes the bin/hadoop-daemon.sh script. This script invokes rsync to synchronize the master node. This means that you need to be able to log in to the master node in a password-free manner from the node. Before, we set up the login from the master node to the password without a key from the node, and now set the reverse login.

If this is due to a problem with the rsync option, see the Bin/hadoop-daemon.sh script, which has the option of rsync in about 82 lines.

The first thing to do, then, is to set the Hadoop master variable in the conf/hadoop-env.sh file. As:
Export hadoop_master=finewine01:/hadoop/hadoop-install/hadoop-x.x.x

Then copy to all from the node, scp/hadoop/hadoop-installl/hadoop-x.x.x/conf/hadoop-env.sh Hadoop@finewine02:/hadoop/hadoop-install /hadoop-x.x.x/hadoop-env.sh

Ultimately, you need to log on to all the nodes and create an SSH key for each machine. The copy is then returned to the master node and appended to the file/hadoop/home/.ssh/authorized_keys file. On each node, do the following:

Ssh-l Nutch FINEWINE02
Cd/hadoop/home/.ssh

ssh-keygen-t RSA (use empty responses for each prompt)
Enter passphrase (empty for no passphrase):
Enter Mahouve Passphrase Recycle:
Your identification super-delegates been saved In/hadoop/home/.ssh/id_rsa.
Your public key super-delegates been saved in/hadoop/home/.ssh/id_rsa.pub.
The key fingerprint is:
A6:5C:C3:EB:18:94:0B:06:A1:A6:29:58:FA:80:0A:BC Nutch@localhost

SCP Id_rsa.pub hadoop@finewine01:/hadoop/home/finewine02.pub

After each operation from the node machine completes, append all files to the master node's Authorized_keys file.

Cd/hadooop/home
Cat Finewine*.pub >> Ssh/authorized_keys

When these operations are complete, each time the bin/start-all.sh script is run, the files are synchronized from each node from the primary node.

8. View the situation

Port: 50070
DFS status

Port: 50060
Tracker State

50030
Map/reduce Management

Other ports:
Dfs.secondary.info.port 50090
Dfs.datanode.port 50010
Dfs.info.port 50070
Mapred.job.tracker.info.port 50030
Mapred.task.tracker.report.port 50050
Tasktracker.http.port 50060

This article from Csdn Blog, reproduced please indicate the source: http://blog.csdn.net/kevin_long/archive/2007/11/08/1872812.aspx

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More