Nutch2.2.1+mysql Fetching Data

Source: Internet
Author: User
Tags xsl

Basic Environment: Linux centos6.5 nutch2.2.1 source package, MySQL 5.5, elasticsearch1.1.1, jdk1.7

1, HTTP://MIRROR.BJTU.EDU.CN/APACHE/NUTCH/2.2.1/Decompression

2, modify the data storage method is MySQL

Modify the Nutch root directory/ivy/ivy.xml file, the original MySQL data storage is commented.

   <dependency org= "Org.apache.gora" name= "Gora-core" rev= "0.2.1" conf= "*->default"/>104      This      gora-sql 0.1.1-incubating artifact is not compatable with Gora-core 0.3106     downgrade to Gora-core 0.2.1 in order to use SQL as a backend. -->107 108     <dependency org= "Org.apache.gora" name= "Gora-sql" rev= "0.1.1-incubating" conf= "*->default"/ >109 the use of      MySQL as database with SQL as Gora store.-->111     <dependency org= "mysq L "name=" Mysql-connector-java "rev=" 5.1.18 "conf=" *->default ">    
3, modify the connection database address and user name, in the Nutch root directory/conf/gora.properties the original comment out
#gora. sqlstore.jdbc.driver=org.hsqldb.jdbc.jdbcdriver#gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/ nutchtest#gora.sqlstore.jdbc.user=sa#gora.sqlstore.jdbc.password=# MySQL Properties ############################# # # #gora. Sqlstore.jdbc.driver=com.mysql.jdbc.drivergora.sqlstore.jdbc.url=jdbc:mysql://ip:3306/nutch? Useunicode=true&characterencoding=utf8&autoreconnect=true&zerodatetimebehavior= Converttonullgora.sqlstore.jdbc.user=usergora.sqlstore.jdbc.password=pwd

4, modify the Conf nutch-site.xml

<?xml version= "1.0"? ><?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?> <configuration> <property><name>http.agent.name</name><value>my spider</value></property> <property><name>http.accept.language</name><value>ja-jp,zh-cn,en-us,en-gb,en;q=0.7,*;q =0.3</value></property> <property><name>parser.character.encoding.default</name><value>utf-8</value><description>The character encoding to fall back to where no other Informationis available</description></property> <property><name>storage.data.store.class</name><value>org.apache.gora.sql.store.SqlStore</value></property> <property> <name>plugin.includes</name><value>protocol-http|urlfilter-regex|parse-(Html|tika) |index-( Basic|anchor) |urlnormalizer-(pass|regex|basic) |scoring-opic</value></property> </configuration >

5. Using ant to compile the source code

Executing ant in the Nutch directory

job: [Jar] Building jar:/home/hadoop/nutch221/build/apache-nutch-2.2.1. Jobruntime: [mkdir] Created dir:/home/hadoop/nutch221/runtime [mkdir] Created dir:/home/hadoop/nutch221/runtime/local [mkdir] Created dir:/home/hadoop/nutch221/runtime/deploy [copy] Copying1 File to/home/hadoop/nutch221/runtime/deploy [copy] Copying2 Files to/home/hadoop/nutch221/runtime/deploy/bin [Copy] Copying1 File to/home/hadoop/nutch221/runtime/local/lib [copy] Copying1 File to/home/hadoop/nutch221/runtime/local/lib/native[Copy] Copyingto/home/hadoop/nutch221/runtime/local/filesconf [Copy] Copying2 Files to/home/hadoop/nutch221/runtime/local/bin [Copy] CopyingFiles to/home/hadoop/nutch221/runtime/local/lib [copy] Copying106 Files to/home/hadoop/nutch221/runtime/local/plugins [Copy] Copied2 Empty directories to 2 empty directories under/home/hadoop/nutch221/runtime/local/testbuild successfultotal Time:Seconds compiled successfully.

6 Creating a Database

CREATE DATABASE nutch default CHARACTER SET UTF8 default COLLATE utf8_general_ci; CREATE TABLE ' webpage ' (' id ' varchar (767) CHARACTER SET latin1 NOT NULL, ' headers ' blob, ' text ' Mediumtext DEFAULT NULL, ' status 'int(11DEFAULT NULL, ' markers ' blob, ' parsestatus ' blob, ' Modifiedtime ' bigint (20) DEFAULT NULL, ' score 'floatDEFAULT NULL, ' typ ' varchar (32) CHARACTER SET latin1 DEFAULT NULL, ' baseUrl ' varchar (512) CHARACTER SET latin1 DEFAULT NULL, ' content ' Mediumblob, ' title ' varchar (2048) DEFAULT NULL, ' reprurl ' varchar (512) CHARACTER SET latin1 DEFAULT NULL, ' Fetchinterval 'int(11) DEFAULT NULL, ' Prevfetchtime ' bigint (20DEFAULT NULL, ' InLinks ' Mediumblob, ' prevsignature ' blob, ' outlinks ' Mediumblob, ' Fetchtime ' bigint (20) DEFAULT NULL, ' Retriessincefetch 'int(11DEFAULT NULL, ' protocolstatus ' blob, ' signature ' blob, ' metadata ' blob,primary KEY (' id ')) ENGINE=innodb DEFAULT Charset=utf8;
7. Perform crawling operations: bin /nutch crawl urls -depth 3
After execution, you can view the crawler crawl Content 8, perform the index operation in MySQL: bin /nutch elasticindex clustername -allProblem encountered: An exception occurred while performing step 7th:
[Email protected] bin]$ Nutch crawl urls-depth 3Exception in Thread"Main"Java.lang.ClassNotFoundException:org.apache.gora.sql.store.SqlStore at java.net.urlclassloader$1.run (urlclassloader.java:366) at java.net.urlclassloader$1.run (urlclassloader.java:355) at java.security.AccessController.doPrivileged (Native Method) at Java.net.URLClassLoader.findClass (Urlclasslo Ader.java:354) at Java.lang.ClassLoader.loadClass (Classloader.java:425) at Sun.misc.launcher$appclassloader.loadclass (Launcher.java:308) at Java.lang.ClassLoader.loadClass (Classloader.java:358At JAVA.LANG.CLASS.FORNAME0 (Native Method) at Java.lang.Class.forName (Class.java:190) at Org.apache.nutch.storage.StorageUtils.getDataStoreClass (Storageutils.java:89) at Org.apache.nutch.storage.StorageUtils.createWebStore (Storageutils.java:73) at Org.apache.nutch.crawl.InjectorJob.run (Injectorjob.java:221) at Org.apache.nutch.crawl.Crawler.runTool (Crawler.java:68) at Org.apache.nutch.crawl.Crawler.run (Crawler.java:136) at Org.apache.nutch.crawl.Crawler.run (Crawler.java:250) at Org.apache.hadoop.util.ToolRunner.run (Toolrunner.java:65) at Org.apache.nutch.crawl.Crawler.main (Crawler.java:257)

Refer to the online material: Http://blog.sina.com.cn/s/blog_3c9872d00101p4f0.html still did not solve.

Official solutions:

Http://mail-archives.apache.org/mod_mbox/nutch-user/201307.mbox/%[email protected].com%3e

Article reference:

Official information: http://nlp.solutions.asia/?p=362

https://issues.apache.org/jira/browse/NUTCH-1473

Nutch2.2.1+mysql Fetching Data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.