As a matter of fact, you can easily configure the distributed framework runtime environment by referring to the hadoop official documentation. However, you can write a little more here, and pay attention to some details, in fact, these details will be explored for a long time. Hadoop can run on a single machine, or you can configure a cluster to run on a single machine. To run on a single machine, you only need to execute commands directly according to the running instructions of the demo. Here we will focus on the cluster configuration and operation process.
Environment
Seven normal machines, all operating systems are Linux. Memory and CPU are not mentioned. A major feature of hadoop is that many machines are not refined. JDK must be above 1.5. Remember this. The machine names of the seven machines must be different. Later we will talk about the impact of machine names on mapreduce.
Deployment considerations
As I described above, hadoop clusters can be divided into two categories: master and slave. The former mainly configures namenode and jobtracker roles, responsible for the execution of distributed data and decomposition tasks. The latter configures the roles of datanode and tasktracker, and is responsible for Distributed Data Storage and task execution. I was going to check whether a machine can be configured as a master and also used as a slave, however, it is found that the machine name configuration conflicts during namenode initialization and tasktracker execution (namenode and tasktracker have some conflicts with the hosts configuration, whether to put the corresponding IP address of the machine name in front of the configuration or put the corresponding IP address of localhost in front is a bit of a problem, but it may also be my own problem. You can give me feedback based on the Implementation ). In the end, a master node and six Server Load balancer instances are determined. machine configuration will be added for the comparison of complex application development and test results.
Procedure
The above steps can start the hadoop distributed environment, and then on the master machine to enter the master installation directory, execute hadoop jar hadoop-0.17.1-examples.jar wordcount input path and output path, you can see the word count statistics. Both the input and output paths here refer to the paths in HDFS. Therefore, you can first create an input path in HDFS by copying the directories in the local file system to HDFS:
Hadoop DFS-copyfromlocal/home/wenchu/test-in. Here,/home/wenchu/test-in is the local path, and test-in is the path that will be created in HDFS, after the execution is complete, you can see that the test-In directory already exists through hadoop DFS-ls test-in, and you can view the content in it through hadoop DFS-ls test-in. The output path must not exist in HDFS. After the demo is executed, you can view the content in the output path through hadoop DFS-ls, you can view the content of a specific file by using the hadoop DFS-cat file name.
Experience summary and precautions (this part is a detour that I took some time to use ):
- Several CONF configuration files on the master and slave do not need to be fully synchronized. If you are sure to start and shut down through the master, the configuration on the slave machine does not need to be maintained. However, if you want to enable or disable hadoop on any machine, you need to maintain consistency.
- The/etc/hosts on the master and slave machines must be configured on all machines in the cluster, even if IP addresses are used in each configuration file. This has suffered a lot of pains. I thought that if the IP address is assigned, the host does not need to be configured. As a result, it is found that it is always stuck when the reduce is executed, and it cannot be continued during the copy process and retries. In addition, if the machine names of two machines in the cluster are repeated, the problem also occurs.
- If a problem occurs when a node is added or deleted, delete the hadoop of slave first. TMP. dir, then restart and try again. If it still does not work, simply put the master's hadoop. TMP. dir deletion (meaning data on DFS will also be lost). If you delete the hadoop. TMP. dir, then you need to re-namenode-format.
- Configure the number of map tasks and the number of reduce tasks. As mentioned in the previous Distributed File System Design, when a file is put into a distributed file system, it is divided into multiple blocks and placed on each datanode. By default, DFS is used. block. the size should be 64 M. That is to say, if the data you place on HDFS is smaller than 64, there will be only one block, which will be placed in a datanode. You can run the following command: hadoop dfsadmin-report shows the storage of each node. You can also directly go to a datanode to view the Directory: hadoop. tmp. DIR/dfs/data/current to see the blocks. The number of blocks directly affects the number of maps. Of course, you can configure the number of MAP and reduce tasks. The number of maps is usually the same as that of blocks to be processed by HDFS by default. You can also set the map quantity or minimum split size. The actual number is max (min (block_size, data/# maps) and min_split_size ). Reduce can be calculated using this formula: 0.95 * num_nodes * mapred. tasktracker. Tasks. Maximum.
In general, it is best to check the log when talking about the problem or starting it.
Summary of commands in hadoop
This part of content can be understood through the help and introduction of the command. I mainly focus on introducing a few of the commands I use. The hadoop DFS command is followed by a parameter for HDFS operations, which is similar to the Linux Command, for example:
- Hadoop DFS-ls is to view the content in the/usr/root directory. By default, if no path is specified, this is the current user path;
- Hadoop DFS-rmr xxx is used to delete directories. It is easy to get started with many commands;
- The hadoop dfsadmin-Report command allows you to view the datanode situation globally;
- Adding parameters after a hadoop job is an operation on the currently running job, such as list and kill;
- Hadoop balancer is the aforementioned command for balancing disk loads.
I will not detail anything else.
Related reading:Introduction to the open-source distributed computing framework hadoop-Introduction to the open-source distributed computing framework hadoop (1 ).
Introduction:Chen wenchu is an architect at the R & D center platform of Alibaba software company. The current main work involves the design and implementation of the Alibaba software development platform service framework (ASF), and the design and implementation of the service integration platform (SIP. There is nothing to be good at or proficient in. The only improvement in my work till now is my learning ability and speed. Personal blog: http://blog.csdn.net/cenwenchu79.
(Source/ 文//infoq)