There are already many tutorials on how to configure hadoop on the Internet. With the instructions on the hadoop homepage, you can configure hadoop clusters on multiple machines. Here we record the problems I encountered during actual configuration and use of hadoop, some of which belong to the hadoop perimeter, such as problems and solutions that are easy to encounter during SSH configuration, for your reference. I am currently installing hadoop through cygwin in Windows XP, and the version is 0.17.2.1.
1. Install cygwin. Download the network installation package at www.cygwin.com. We recommend that you directly select the OpenSSH component when selecting components. In some versions of cygwin, diffutils components do not appear to be installed automatically, so you need to manually select them, otherwise, the component package is missing when you configure SSH.
2. The default prompt in the cygwin console is not used to it. It is much better to use export PS1 = "\ U @ \ W $.
3. sshd configuration is relatively simple. You can refer to this link. When I asked "shocould Privilege Separation be used" during SSH-host-config? (Yes/No). If you want to answer yes due to security reasons, you may encounter an error message "Privilege Separation user sshd does not exist" when starting sshd, for solutions, refer to this link.
4. it took some time to configure SSH to use the Certificate for automatic verification. Later I thought it was possible that there were two versions of cygwin on the machine, because there was no problem with the installation on the other machine. When you are prompted to enter the password after using the ssh-keygen-t rsa command, press ENTER twice. Public Key replication and other processes are not described here.
5. Try to run hadoop wordcount in eclipseProgramException: "javax. security. auth. login. loginexception: Login Failed: CreateProcess: whoami error = 2 ". The solution is to add c: \ cygwin \ bin to the path environment variable of the system, and then restart eclipse to make the change take effect.
6. If Java heap size is not enough when running wordcount in eclipse, add-xms200m to the running configuration to solve the problem. (Does hadoop's helloworld require a large amount of memory ?)
7. this link is discussed when the operation to be run depends on a third-party class library. However, no solution except using the hadoop jar command in the command line is found, for example, in versions 0.17.2 and 0.18.1, I do not see methods like addjar () in the jobconf class () if you use commas to separate multiple jar files, the file cannot be found. There may be two solutions: a) copy the required third-party jar file to the JRE of each node machine (no test is conducted currently) B) put the third-party jar package and your class into a package.
Update: Find another method on the Internet, which is implemented through distributedcache. The original text may be incorrect. The correct method is:
Call the distributedcache. addarchivetoclasspath () method. Note that the first parameter must be a relative path, for example, "/test/lib
/My. Jar "instead of an absolute path like" HDFS: // 192.168.0.5: 47110/test/lib/My. Jar. About
The distributedcache descriptions are as follows.
8. The way to debug the mapreduce program is clearly described in this link, so it is useful to repeat it. If the file is stored in HDFS, you only need to call jobconf #. set ("mapred. job. tracker "," local "); you can; if the file also exists locally, you also need to call jobconf # Set (" FS. default. name "," local "); method. I usually put files in HDFS for debugging, because to use a local file, either the parameter needs to be changed orCodeIt is difficult to maintain the two environments. The output content of system. Out. println () in the program can be found in the logs/userlogs directory of the hadoop installation path of datanode.
9. When using a custom inputformat, especially when using the EMF model element as the key, note that the XMI: ID value is not obtained anywhere in the code. Specifically, it can be obtained in the writablecomparable # Write () method (provided that the object already has a resource, that is, eobj. eresource ()! = NULL), but it cannot be obtained in writablecomparable # readfields (). It cannot be obtained in the recordwriter # Write () method, because the EMF element objects of the latter two are deserialized, they are no longer the original Instance in the memory.
10. After map reaches 100%, the reduce process does not continue after a certain value (such as 16%) until it is forcibly disabled by hadoop. Record the following in the tasknode log:
11:17:06, 455 info Org. apache. hadoop. mapred. tasktracker: task_200811191041_0015_r_000000_0 0.16666667% reduce> copy (6 of 12 at 0.00 Mb/s)> 11:17:09, 455 info Org. apache. hadoop. mapred. tasktracker: task_200811191041_0015_r_000000_0 0.16666667% reduce> copy (6 of 12 at 0.00 Mb/s)> 11:17:15, 455 info Org. apache. hadoop. mapred. tasktracker: task_200811191041_0015_r_000000_0 0.16666667% reduce> copy (6 of 12 at 0.00 Mb/s)> 11:17:18, 705 fatal Org. apache. hadoop. mapred. tasktracker: task: task_200811191041_0015_r_000000_0-killed due to shuffle failure: exceeded max_failed_unique_fetches; bailing- Out. 11:17:18, 705 Info org. Apache. hadoop. mapred. tasktracker: About to purge task: task_200811191041_0015_r_000000_0 11:17:18, 705Info org. Apache. hadoop. mapred. taskrunner: task_200811191041_0015_r_000000_0 done; removing files. 11:17:18, 705 Warn org. Apache. hadoop. mapred. tasktracker: Unknown child task finshed: task_200811191041_0015_r_000000_0. ignored. 11:17:40, 845 info org. Apache. hadoop. mapred. tasktracker: received 'killjobaction' For Job: job_200811191041_0015 11:17:40, 845 Info org. Apache. hadoop. mapred. taskrunner: task_200811191041_0015_m_000011_0 done; removing files. 11:17:40, 845 info org. Apache. hadoop. mapred. taskrunner: task_200811191041_0015_m_000005_0 done; removing files.
Enter the following in my java application console:
08/11/20 11:06:39 info mapred. jobclient: Map 96% reduce 11% 08/11/20 11:06:40 info mapred. jobclient: Map 100% reduce 11% 08/11/20 11:06:43 info mapred. jobclient: Map 100% reduce 13% 08/11/20 11:06:47 info mapred. jobclient: Map 100% reduce 16% (Stopped for a long time here) 08/11/20 11:17:12 info mapred. jobclient: Map 100% reduce 0%/20 11:17:12 Info mapred. jobclient: task id: task_200811191041_0015_r_000000_0, status: failedshuffle error: exceeded max_failed_unique_fetches; bailing - Out. 08/11/20 11:17:14 Warn mapred. jobclient: Error reading task outputnode2 08/11/20 11:17:14 Warn mapred. jobclient: Error reading task outputnode2 08/11/20 11:17:25 info mapred. jobclient: Map 100% reduce 16% 08/11/20 11:17:30 info mapred. jobclient: Map 100% reduce 25% 08/11/20 11:17:31 info mapred. jobclient: Map 100% reduce 100% 08/11/20 11:17:32 info mapred. jobclient: job complete: job_200811191041_0015
I want to find the problem. The parameter DFS. http. Address is not configured on the machine where secondary Name node is located. The default value for this parameter in the hadoop-default.xml is 0.0.0.0: 50070 and should be changed to the IP address of the machine where Name node is located. Reference
11. Some Reference Links.
Http://hayesdavis.net/2008/06/14/running-hadoop-on-windows/
Http://hi.baidu.com/shirdrn/blog/category/Hadoop
Http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop1/index.html
Http://blog.ring.idv.tw/comment.ser? I = 231.