Import the Nutch source code into the Eclipse Engineering custom Crawl task.
Download Source:
http://svn.apache.org/repos/asf/nutch/
Download the desired nutch source from SVN, choose nutch-1.1 here
Compile the source code:
Using ant to compile the source code, compile successfully, you can see a more build directory, which has plugins directory and Nutch-1.1.job file
New Web Project
Create a new Web project Org.apache.nutch.web and do the following:
1. Copy the Src/java directory of the Nutch source code to the SRC directory of the Web project.
2. Copy the src/conf directory of the Nutch source code to the SRC directory of the Web project.
3. Copy the Src/lib directory of the Nutch source code to the Web-inf/lib directory of the Web project
4. Copy the compiled plugins directory to the SRC directory of the Web project
5. Create a new directory job in Web project SRC, copy the compiled nutch-1.1.job file to Src/job
6, in the Web project src New directory test, set up a test class, use this class to call Crawl's main ()
Package Org.apache.nutch;import org.apache.nutch.crawl.Crawl; Public classMain { Public Static voidMain (string[] args) {string[] arg= {"/urls/url.txt","-dir","crawled","-depth","Ten","-TOPN", " -" }; Try{crawl.main (ARG); } Catch(Exception e) {e.printstacktrace (); } }}
Note:
1, Nutch use Hadoop scheduling task, before using to edit the Conf directory core-site.xml, Hdfs-site.xml, Mapred-site.xml and other Hadoop configuration files.
2, Nutch-1.1 does not have hbase jar file, need to download and set the HBase configuration file separately, here with Hbase-0.94.jar
FAQ:
This is a netizen collects the Hadoop,hbase,zookeeper error log and partial solution, in case later encounters the question as the reference.
1. hadoop-0.20.2 & hbase-0.90.4 Troubleshooting cluster startup errors:
The problem is as follows: Org.apache.hadoop.ipc.rpc$versionmismatch:protocol Org.apache.hadoop.hdfs.protocol.ClientProtocol version Mismatch. (client = $, Server = 41)
At Org.apache.hadoop.ipc.RPC.getProxy (rpc.java:364)
Atorg.apache.hadoop.hdfs.DFSClient.createRPCNamenode (dfsclient.java:113)
Atorg.apache.hadoop.hdfs.dfsclient.<init> (dfsclient.java:215)
Atorg.apache.hadoop.hdfs.dfsclient.<init> (dfsclient.java:177)
hadoop-0.20.2& hbase-0.90.4 version of the problem caused, Hbase\lib introduced by the package to replace the Hadoop-0.20.2-core.jar can be
2. org.apache.hadoop.security.AccessControlException:Permission denied:user=pc2000, Access=write
Because Eclipse uses the Hadoop plugin to submit jobs, it defaults to pc2000 (computer name) to write jobs to the HDFs file system, corresponding to the/user/xxx on HDFs, because pc2000 users do not have write access to the Hadoop directory. So the occurrence of the exception is caused.