Basic Environment: Linux centos6.5 nutch2.2.1 source package, MySQL 5.5, elasticsearch1.1.1, jdk1.7
1, HTTP://MIRROR.BJTU.EDU.CN/APACHE/NUTCH/2.2.1/Decompression
2, modify the data storage method is MySQL
Modify the Nutch root directory/ivy/ivy.xml file, the original MySQL data storage is commented.
<dependency org= "Org.apache.gora" name= "Gora-core" rev= "0.2.1" conf= "*->default"/>104 This gora-sql 0.1.1-incubating artifact is not compatable with Gora-core 0.3106 downgrade to Gora-core 0.2.1 in order to use SQL as a backend. -->107 108 <dependency org= "Org.apache.gora" name= "Gora-sql" rev= "0.1.1-incubating" conf= "*->default"/ >109 the use of MySQL as database with SQL as Gora store.-->111 <dependency org= "mysq L "name=" Mysql-connector-java "rev=" 5.1.18 "conf=" *->default ">
3, modify the connection database address and user name, in the Nutch root directory/conf/gora.properties the original comment out
#gora. sqlstore.jdbc.driver=org.hsqldb.jdbc.jdbcdriver#gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/ nutchtest#gora.sqlstore.jdbc.user=sa#gora.sqlstore.jdbc.password=# MySQL Properties ############################# # # #gora. Sqlstore.jdbc.driver=com.mysql.jdbc.drivergora.sqlstore.jdbc.url=jdbc:mysql://ip:3306/nutch? Useunicode=true&characterencoding=utf8&autoreconnect=true&zerodatetimebehavior= Converttonullgora.sqlstore.jdbc.user=usergora.sqlstore.jdbc.password=pwd
4, modify the Conf nutch-site.xml
<?xml version= "1.0"? ><?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?> <configuration> <property><name>http.agent.name</name><value>my spider</value></property> <property><name>http.accept.language</name><value>ja-jp,zh-cn,en-us,en-gb,en;q=0.7,*;q =0.3</value></property> <property><name>parser.character.encoding.default</name><value>utf-8</value><description>The character encoding to fall back to where no other Informationis available</description></property> <property><name>storage.data.store.class</name><value>org.apache.gora.sql.store.SqlStore</value></property> <property> <name>plugin.includes</name><value>protocol-http|urlfilter-regex|parse-(Html|tika) |index-( Basic|anchor) |urlnormalizer-(pass|regex|basic) |scoring-opic</value></property> </configuration >
5. Using ant to compile the source code
Executing ant in the Nutch directory
job: [Jar] Building jar:/home/hadoop/nutch221/build/apache-nutch-2.2.1. Jobruntime: [mkdir] Created dir:/home/hadoop/nutch221/runtime [mkdir] Created dir:/home/hadoop/nutch221/runtime/local [mkdir] Created dir:/home/hadoop/nutch221/runtime/deploy [copy] Copying1 File to/home/hadoop/nutch221/runtime/deploy [copy] Copying2 Files to/home/hadoop/nutch221/runtime/deploy/bin [Copy] Copying1 File to/home/hadoop/nutch221/runtime/local/lib [copy] Copying1 File to/home/hadoop/nutch221/runtime/local/lib/native[Copy] Copyingto/home/hadoop/nutch221/runtime/local/filesconf [Copy] Copying2 Files to/home/hadoop/nutch221/runtime/local/bin [Copy] CopyingFiles to/home/hadoop/nutch221/runtime/local/lib [copy] Copying106 Files to/home/hadoop/nutch221/runtime/local/plugins [Copy] Copied2 Empty directories to 2 empty directories under/home/hadoop/nutch221/runtime/local/testbuild successfultotal Time:Seconds compiled successfully.
6 Creating a Database
CREATE DATABASE nutch default CHARACTER SET UTF8 default COLLATE utf8_general_ci; CREATE TABLE ' webpage ' (' id ' varchar (767) CHARACTER SET latin1 NOT NULL, ' headers ' blob, ' text ' Mediumtext DEFAULT NULL, ' status 'int(11DEFAULT NULL, ' markers ' blob, ' parsestatus ' blob, ' Modifiedtime ' bigint (20) DEFAULT NULL, ' score 'floatDEFAULT NULL, ' typ ' varchar (32) CHARACTER SET latin1 DEFAULT NULL, ' baseUrl ' varchar (512) CHARACTER SET latin1 DEFAULT NULL, ' content ' Mediumblob, ' title ' varchar (2048) DEFAULT NULL, ' reprurl ' varchar (512) CHARACTER SET latin1 DEFAULT NULL, ' Fetchinterval 'int(11) DEFAULT NULL, ' Prevfetchtime ' bigint (20DEFAULT NULL, ' InLinks ' Mediumblob, ' prevsignature ' blob, ' outlinks ' Mediumblob, ' Fetchtime ' bigint (20) DEFAULT NULL, ' Retriessincefetch 'int(11DEFAULT NULL, ' protocolstatus ' blob, ' signature ' blob, ' metadata ' blob,primary KEY (' id ')) ENGINE=innodb DEFAULT Charset=utf8;
7. Perform crawling operations:
bin
/nutch
crawl urls -depth 3
After execution, you can view the crawler crawl Content 8, perform the index operation in MySQL:
bin
/nutch
elasticindex clustername -all
Problem encountered: An exception occurred while performing step 7th:
[Email protected] bin]$ Nutch crawl urls-depth 3Exception in Thread"Main"Java.lang.ClassNotFoundException:org.apache.gora.sql.store.SqlStore at java.net.urlclassloader$1.run (urlclassloader.java:366) at java.net.urlclassloader$1.run (urlclassloader.java:355) at java.security.AccessController.doPrivileged (Native Method) at Java.net.URLClassLoader.findClass (Urlclasslo Ader.java:354) at Java.lang.ClassLoader.loadClass (Classloader.java:425) at Sun.misc.launcher$appclassloader.loadclass (Launcher.java:308) at Java.lang.ClassLoader.loadClass (Classloader.java:358At JAVA.LANG.CLASS.FORNAME0 (Native Method) at Java.lang.Class.forName (Class.java:190) at Org.apache.nutch.storage.StorageUtils.getDataStoreClass (Storageutils.java:89) at Org.apache.nutch.storage.StorageUtils.createWebStore (Storageutils.java:73) at Org.apache.nutch.crawl.InjectorJob.run (Injectorjob.java:221) at Org.apache.nutch.crawl.Crawler.runTool (Crawler.java:68) at Org.apache.nutch.crawl.Crawler.run (Crawler.java:136) at Org.apache.nutch.crawl.Crawler.run (Crawler.java:250) at Org.apache.hadoop.util.ToolRunner.run (Toolrunner.java:65) at Org.apache.nutch.crawl.Crawler.main (Crawler.java:257)
Refer to the online material: Http://blog.sina.com.cn/s/blog_3c9872d00101p4f0.html still did not solve.
Official solutions:
Http://mail-archives.apache.org/mod_mbox/nutch-user/201307.mbox/%[email protected].com%3e
Article reference:
Official information: http://nlp.solutions.asia/?p=362
https://issues.apache.org/jira/browse/NUTCH-1473
Nutch2.2.1+mysql Fetching Data