First, the Environment preparation
The first must be the configuration development environment, which is not described in detail here for the time being.
The required environment is jdk1.7,myeclipse,svn,ant, as well as the two plugins under MyEclipse subclipse and ivyde,http://subclipse.tigris.org/update_1.8.x and HTTP ://www.apache.org/dist/ant/ivyde/updatesite.
Second, check out the project from SVN
And then next to the next step.
Finish importing.
Third, modify the Ivy directory Ivysetting.xml address http://mirrors.ibiblio.org/maven2/(this address access is normal, the rest of the address I try to access not)
Iv. Modify the Ivy.xml in the Ivy directory (add MySQL access dependent Java package)
Modify the Gora-core version to 0.2.1 and dismiss the annotations Gora-sql and Mysql-connector-java
V. Cd to directory execution ant Eclipse (directly under Eclipse ant build looks like a problem)
Six, back to Eclipse Project, refresh the project, you will find that the directory structure has changed
Seven, see there is another error, is the problem of coding, engineering right click Properties-Resource->utf-8
Viii. Project Right-click Build path->config Build Path->order and Export Select Conf folder pinned
Nine, modify the Conf folder gora.properties configuration MySQL
#Default MySQL Properties ############################### #gora. datastore.default= Org.apache.gora.sql.store.sqlstoregora.datastore.autocreateschema=truegora.sqlstore.jdbc.driver= Com.mysql.jdbc.drivergora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createdatabaseifnotexist=true &useunicode=true&characterencoding=utf8&autoreconnect=true&zerodatetimebehavior= converttonullgora.sqlstore.jdbc.user=rootgora.sqlstore.jdbc.password=123456
Ten, the project directory under the new folder Urls,urls directory to create a new file URL, enter the Root_url to crawl, such as http://www.qq.com
Xi. Configuring the Conf directory Nutch-site.xml
<property> <name>http.agent.name</name> <value>YourNutchSpider</value> </ Property> <property> <name>http.accept.language</name> <value>ja-jp, En-us,en-gb,en, Zh-cn,zh-tw;q=0.7,*;q=0.3</value> <description>value of the "Accept-language" Request header field. This allows selecting Non-english language as the default one to retrieve. It's a useful setting for search engines build for certain national group.</description> </property> <PR Operty> <name>parser.character.encoding.default</name> <value>utf-8</value> < description>the character encoding to fall back to when no other information is available</description> </pro perty> <property> <name>plugin.folders</name> <value>src/plugin</value> < Description>directories where Nutch plugins is located. Each element is a relative or absolute path. If Absolute, IT is used as is. If relative, it is searched for on the classpath.</description> </property>
</pre><pre name= "code" class= "HTML" ><!--to solve null pointer problems in the Utf-8 class--><property><span></ span><name>generate.batch.id</name><span></span><value>*</value></ property> <property> <name>storage.data.store.class</name> <value> Org.apache.gora.sql.store.sqlstore</value> <description>the Gora DataStore class for storing and Retrieving data. Currently the following stores is available: ....</description> </property>
12. After configuring the above steps, configure the command to run the running configuration select Org.apache.nutch.crawl.Crawler, parameter settings urls-depth 3-TOPN 5 and-dhadoop.log.dir=logs- Dhadoop.log.file=hadoop.log
At this point the run encounters an error exception in thread "main" java.io.IOException:Failed to set permissions of path: \tmp\hadoop-administrator\map Red\staging\administrator606301699\.staging to 0700
13, the above error is generally only under Windows will encounter, So our general practice is to find hadoop-core-1.2.0 source in the Org.apache.hadoop.fs under the Fileutil.java modify the Checkreturnvalue method, comment out the contents of it
private static void Checkreturnvalue (Boolean rv, File p, fspermission permission ) throws IOException {// if (!RV) {// throw new IOException ("Failed to set permissions of path:" + p +// "to" +// String.Format ("%04o", per Mission.toshort ()));/ } }
Then compile the Java package to replace the Hadoop-core-1.2.0.jar under our engineering build/lib.
Another way is to find the Fileutil.java compiled class file, replace the corresponding class file in the jar package Fileutil.clas and Fileutil$cygpathcommand.clas (with the compression software to open the line)
Attach the modified compiled file Http://files.cnblogs.com/e-life/hadoop-core-1.2.0.rar
14, the next run will be no problem