In fact, see the official Hadoop document has been able to easily configure the distributed framework to run the environment, but since the write a little bit more, at the same time there are some details to note that the fact that these details will let people grope for half a day. Hadoop can run stand-alone, but also can configure the cluster run, single run will not need to say more, just follow the demo running instructions directly to execute the command. The main point here is to talk about the process of running the cluster configuration.
Environmental
7 ordinary machines, operating systems are Linux. Memory and CPU will not say, anyway Hadoop is a big feature is the machine in more than fine. JDK must be more than 1.5, this remember. The machine name of 7 machines must be different, the following talks to the machine name for MapReduce has a great impact.
Deployment Considerations
As I described above, there are two broad categories of roles for Hadoop clusters: Master and slave, which are primarily configured with the roles of Namenode and Jobtracker, responsible for the execution of distributed data and decomposition tasks for the explorer, The latter configures the roles of Datanode and Tasktracker, responsible for distributed data storage and task execution. I was going to see if a machine could be configured as Master, but also as a slave, However, it was found that there was a conflict between the machine name configuration during Namenode initialization and Tasktracker execution (Namenode and Tasktracker had some conflicts with the hosts configuration, is the machine name corresponding to the IP on the configuration front or the localhost corresponding IP in front of a bit of a problem, but may also be my own problem, this can be based on the implementation of the situation to give me feedback. Finally, the decision of a master, six sets of slave, subsequent complex application development and test results of the comparison will increase the machine configuration.
The
implementation steps create the same directory on all machines, or you can set up the same user and use the user's home path to install the Hadoop installation path. For example, I built/home/wenchu on all the machines. Download Hadoop, first extract to master. Here I am the version of the downloaded 0.17.1. At this point Hadoop installation path is/home/wenchu/hadoop-0.17.1. After decompression into the Conf directory, the main need to modify the following documents: Hadoop-env.sh,hadoop-site.xml, masters, slaves.
The underlying configuration file for Hadoop is Hadoop-default.xml, and the code for Hadoop can tell that a job will be created by default when a job is created Config,config first read into the Hadoop-default.xml configuration and then read the configuration of Hadoop-site.xml (which File is initially configured to be empty. The main configuration in Hadoop-site.xml is the system-level configuration of the hadoop-default.xml you need to cover, as well as the custom configuration you need to use in your mapreduce process (specific uses such as final reference documentation).
The following is a simple hadoop-site.xml configuration:
<?xml version= "1.0"?>
<?xml-stylesheet type= "text/xsl" href= configuration.xsl "?>"
<!--put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>//your namenode configuration, machine name plus port
<value>hdfs://10.2.224.46:54310/</value>
</property>
<property>
<name>mapred.job.tracker</name>//your jobtracker configuration, machine name plus port
<value>hdfs://10.2.224.46:54311/</value>
</property>
<property>
The number of
<name>dfs.replication</name>//data needs to be backed up, and the default is three
<value>1</value>
</property>
<property>
<name>hadoop.tmp.dir</name>//hadoop Default temporary path, this best configuration, if in the new node or other circumstances inexplicable datanode can not start, Delete the TMP directory in this file. However, if this directory is removed from the Namenode machine, then the Namenode formatted command needs to be executed again.
<value>/home/wenchu/hadoop/tmp/</value>
</property>
<property>
Some parameters of the
<name>mapred.child.java.opts</name>//java virtual machine can refer to configuration
<value>-Xmx512m</value>
</property>
<property>
the size of the <name>dfs.block.size</name>//block, the unit byte, which will be used later, must be a multiple of 512 because CRC is the file integrity check, The default configuration 512 is the smallest unit of the checksum.
<value>5120000</value>
<description>the default block size for new files.</description>
</property>
</configuration>
hadoop-env.sh file only needs to modify one parameter:
# The Java implementation to use. Required.
Export Java_home=/usr/ali/jdk1.5.0_10
Configure your Java path, remember that you must be more than 1.5 version, lest there is a problem.
Masters IP or machine name is configured in Masters, and if it is the machine name, it needs to be set in hosts. Slaves is configured with slaves IP or machine name, and also if the machine name needs to be set in hosts. The example below, I configure here are IP:
Masters:
10.2.224.46
Slaves:
10.2.226.40
10.2.226.39
10.2.226.38
10.2.226.37
10.2.226.41
10.2.224.36 set up master to each slave SSH letter certificate. Since Master will start all slave Hadoop via SSH, it is necessary to establish a one-way or two-way certificate to ensure that the command does not need to enter a password when executing. Execute on Master and all slave machines: ssh-keygen-t RSA. When you execute this command, you only need to enter when you see the prompt. The id_rsa.pub certificate file is then generated below the/root/.ssh/, and the file on the master machine is copied to the slave via the SCP (remember to modify the name), for example: SCP root@masterip:/root/.ssh/id_ Rsa.pub/root/.ssh/46_rsa.pub, then execute cat/root/.ssh/46_rsa.pub >>/root/.ssh/authorized_keys, build Authorized_ Keys file, you can open this file to see, that is, RSA's public key as key,user@ip as value. At this point you can experiment, from master ssh to slave no longer need a password. By slave The reverse establishment is likewise. Why reverse it? In fact, if it's always master startup and shutdown, then there's no need to create a reverse, but if you want to shut down Hadoop in slave you need to create a reverse. Copy the Hadoop on master through the SCP to each slave directory, modifying its hadoop-env.sh according to the java_home of each slave. Modify Master/etc/profile:
Add the following: (the specific content according to your installation path modified, this step is only for easy use)
Export hadoop_home=/home/wenchu/hadoop-0.17.1
Export path= $PATH: $HADOOP _home/bin After the modification, execute source/etc/profile to make it effective. To perform the Hadoop namenode–format on master, this is the first initialization that needs to be done, and it can be seen as a format bar, except that I mentioned above to delete the Hadoop.tmp.dir directory on master, otherwise it does not need to be executed again. Then execute the start-all.sh on master, which can be executed directly because it was added to the path path in 6, the command is to start HDFs and MapReduce, and of course you can separately start HDFs and MapReduce, respectively, in the bin directory start-dfs.sh and St Art-mapred.sh. Check master's logs directory to see if the Namenode log and the Jobtracker log start properly. Check the slave logs directory to see if the Datanode log and the Tasktracker log are normal. If you need to close, execute stop-all.sh directly.
The above steps allow you to start the distributed environment of Hadoop, and then enter the master's installation directory in Master's machine, executing the Hadoop jar Hadoop-0.17.1-examples.jar wordcount input path and output path, You can see the effect of word count. Both the input path and the output path here refer to the path in HDFs, so you can first create the input path in HDFs by copying the directory in the local file system to HDFs:
Hadoop dfs-copyfromlocal/home/wenchu/test-in test-in. Where/home/wenchu/test-in is a local path, test-in is the path that will be built in HDFs, and after execution you can see test-in directory already exists through Hadoop dfs–ls, and you can use Hadoop dfs–ls Test-in see what's inside. The output path requirements do not exist in the HDFs, and when the demo is done, you can see the content through the Hadoop dfs–ls output path, and the contents of the specific file can be viewed through the Hadoop dfs–cat file name.
Experience Summary and Precautions (this part is I spent some time in the process of taking the detour):
Several conf profiles on master and slave do not require full synchronization, and if they are determined to be started and shut down by master, the configuration on the slave machine does not need to be maintained. But if you want to start and turn Hadoop on any machine, you need to be consistent. In the hosts on Master and slave machines, the machines in the cluster must be configured, even if IP is used in each configuration file. This has suffered a lot, the original thought that if the IP does not need to configure host, the results found in the implementation of reduce always stuck, in the copy of the time can not continue, and constantly try again. Additionally, if the machine name of two machines in the cluster is duplicated, the problem occurs. If there is a problem when you add a new node or delete a node, First go to delete slave Hadoop.tmp.dir, and then restart to try, if not, then simply delete Master's Hadoop.tmp.dir (meaning that the data on DFS will also be lost), if you delete the Hadoop.tmp.dir of master, then you need to restart Namenode –format. The number of map tasks and the number of reduce task configurations. The previous Distributed File system design mentions that a file is placed into a distributed file system and is split into multiple blocks placed on each datanode. The default dfs.block.size should be 64M, which means that if you place the data on the HDFs less than 64, there will be only one block, which will be placed in a datanode, which can be done by using the command: Hadoop dfsadmin– You can see how each node is stored. You can also go to a certain datanode to view the directory: Hadoop.tmp.dir/dfs/data/current can see those blocks. The number of blocks will directly affect the number of maps. Of course, you can set the number of tasks for map and reduce by configuration. The number of maps usually defaults to the same blocks as HDFs needs to handle. You can also configure the number of maps or configure the minimum split size to set the actual number: Max (min (block_size,data/#maps), min_split_size). Reduce can be calculated by this formula: 0.95*num_nodes*mapred.tasktracker.tasks.maximum.
In general there is a problem or the start of the best to look at the log, so the heart.
Commands in Hadoop summary
This part of the content can be in fact through the command of help and introduction to understand, I mainly focus on the introduction of the more I use a few commands. Hadoop DFS This command is followed by the HDFs operation, which is similar to the Linux operating system command, for example:
Hadoop Dfs–ls is to view the contents of the/usr/root directory, default if not fill the path this is the current user path; Hadoop dfs–rmr xxx is to delete the directory, there are many commands to see it easy to get started; Hadoop dfsadmin– This command can view the datanode situation globally; Hadoop job to add parameters to the current running job, such as List,kill, and Hadoop balancer is the aforementioned command to balance the disk load.
Others are not detailed.