MAC Nutch+mysql Integration Notes

Source: Internet
Author: User
Tags java se

        <div property="schema:text" class="field field--name-body field--type-text-with-summary field--label-hidden field__item"><p> 目的: <em>Nutch爬虫引擎抓取的数据自动存入MySQL</em> </p>

Affiliation: nutch+hadoop+hbase (MySQL) +elasticsearch+php Series Practice

MAC MySQL Installation

No configuration is required, that is, next remembers the password in the popup window.

: http://dev.mysql.com/downloads/mysql/

Nutch installation and configuration and use of

1, Nutch-2.3.1 download: http://nutch.apache.org/downloads.html download, and then extract to the local installation directory, such as the local root directory is ${nutch_home};

2, configuration NUTCH support for MySQL, modify the ${apache_nutch_home}/ivy/ivy.xml file, respectively:

1) Locate the following line to uncomment

<dependency org= "MySQL" name= "Mysql-connector-java" rev= "5.1.18" conf= "*->default"/>

2) Modify the following line

Default is

<dependency org= "Org.apache.gora" name= "Gora-core" rev= "0.3" conf= "*->default"/>

After modification to

<dependency org= "Org.apache.gora" name= "Gora-core" rev= "0.2.1" conf= "*->default"/>

3) Uncomment the line

<dependency org= "Org.apache.gora" name= "Gora-sql" rev= "0.1.1-incubating" conf= "*->default"/>

Note: on 2), 3) If you do not modify the exception information is

Exception in thread "main" Java.lang.ClassNotFoundException:org.apache.gora.sql.store.SqlStore

3. Database Connection Configuration

Edit the ${nutch_home}/conf/gora.properties file, comment out the default database connection configuration, and add the following configuration content:

################################ MySQL Properties ############################## #gora. sqlstore.jdbc.driver=com.my sql.jdbc.drivergora.sqlstore.jdbc.url=jdbc:mysql://192.168.58.1:3306/nutch?createdatabaseifnotexist= truegora.sqlstore.jdbc.user=rootgora.sqlstore.jdbc.password=

Write the database address and user name password you need to connect

4. Modify the Nutch-site configuration file

Add the following to the configuration node in ${nutch_home}/conf/nutch-site.xml

<property><name>http.agent.name</name><value>liuxun Nutch Spider</value></ PROPERTY&GT;&NBSP;&LT;PROPERTY&GT;&LT;NAME&GT;HTTP.ACCEPT.LANGUAGE&LT;/NAME&GT;&LT;VALUE&GT;JA-JP, EN-US,EN-GB, En;q=0.7,*;q=0.3</value><description>value of the "Accept-language" Request header field. This allows selecting Non-english language as the default one to retrieve. It's a useful setting for search engines build for certain national group.</description></property> < property><name>parser.character.encoding.default</name><value>utf-8</value>< description>the character encoding to fall back to when no other Informationis available</description></prope Rty> <property><name>storage.data.store.class</name><value> Org.apache.gora.sql.store.sqlstore</value><description>the Gora DataStore class for storing and Retrieving data. Currently the following stores is available: .... </description> </property>//Special Additions <property>    <name> Generate.batch.id</name>    <value>*</value></property>

5. Compiling Nutch-2.3.1

    1. Enter the ${nutch_home} directory to execute the ant command: Ant runtime
    2. After the compilation succeeds, there will be a runtime directory under the ${nutch_home} directory.
Compiling Nutch
?   apache-nutch-2.3.1 (Master)? Antbuildfile:/users/hackgyj/apache-nutch-2.3.1/build.xmltryingto overrideolddefinitionoftaskjavac  [ TASKDEF] Couldnot loaddefinitionsfromresourceorg/sonar/ant/antlib.xml. Itcouldnot Befound. ivy-probe-antlib: ivy-download:  [taskdef] Couldnot Loaddefinitionsfromresourceorg/sonar/ant/antlib.xml. Itcouldnot befound. ivy-download-unchecked: ivy-init-antlib: ivy-init: init:     [mkdir] Createddir:/users/hackgyj/apache-nutch-2.3.1/build    [mkdir] Createddir :/users/hackgyj/apache-nutch-2.3.1/build/classes    [mkdir] Createddir:/Users/hackgyj/ Apache-nutch-2.3.1/build/release    [mkdir] Createddir:/USERS/HACKGYJ/APACHE-NUTCH-2.3.1/ Build/test    [mkdir] Createddir:/users/hackgyj/apache-nutch-2.3.1/build/test/classes  Clean-lib: resolve-default:[ivy:resolve]:: Apacheivy 2.3.0-201301101427: http://ant.apache.org/ivy/:: [Ivy:resolve]:: loadingsettings:: File =/users/hackgyj/apache-nutch-2.3.1/ivy/ Ivysettings.xml

The above error, need to download sonar jar package (Sonar-ant-task-2.2.jar), and put the jar package into the extracted apache-nutch-2.3.1 folder within the Lib file. Due to the need to connect to network download resources, it takes some time, depending on the network situation time, I took about an hour myself!

The command line then executes:

Ant Clear

Re-execute

Ant Runtime

OK, no error, compile successfully, the directory is more than: Build, runtime two folders, where runtime is a compiled directory.

6. Web page Crawl and configuration

    1. Enter the ${nutch_home}/runtime/local directory
    2. Set Crawl Sites

Execute command

Mkdir-p URLs//Suggested Crawler Connection folder echo ' http://www.oschina.net/' > urls/seed.txt//write crawl connection Bin/nutch crawl urls-depth 3-topn 5 Start a crawler job
Error:java_home is not set.

Tip Java_home Not set

MAC OS X El Capitan 10.11.6 Find and set $java_home, the command is as follows

?   ~ (Master)? whichjava/usr/bin/java?  ~ (Master)? Ls-l/usr/bin/javalrwxr-xr-x  1 root  wheel  74 Oct 20  2015/usr/bin/java- >/system/library/frameworks/javavm.framework/versions/current/commands/java?  ~ (Master)? Ls-l/system/library/frameworks/javavm.framework/versionstotal 64lrwxr-xr-x  1 Root  wheel &NBSP;&NBSP;10 Oct 20  2015 1.4-currentjdklrwxr-xr-x  1 root  wheel   Oct 20  2015 1.4.2-currentjdklrwxr-xr-x  1 root  wheel  10 Oct 20 &NBSP;&NBSP;2015 1.5-currentjdklrwxr-xr-x  1 root  wheel  10 Oct 20   1.5.0-currentjdklrwxr-xr-x  1 root  wheel  10 Oct 20  2015 1.6- > currentjdklrwxr-xr-x  1 root  wheel  10 Oct 20  2015 1.6.0 Currentjdkdrwxr-xr-x  10 root  wheel  340 oct  7 13:55 alrwxr-xr-x  1 root   Wheel    1 Oct 20  2015 Current, alrwxr-xr-x  1 Root  wheel   52 Oct 20  2015 CURRENTJDK-/library/java/javavirtualmachines/1.6.0.jdk/contents?   ~ (Master)? Java-versionjavaversion "1.8.0_91" Java (tm) seruntimeenvironment (build 1.8.0_91-b14) Javahotspot (tm) 64-BITSERVERVM ( Build 25.91-b14, Mixedmode)?   ~ (Master)? /usr/libexec/java_home-vmatchingjavavirtualmachines (3):     1.8.0_91, x86_64: "Java SE 8"/ library/java/javavirtualmachines/jdk1.8.0_91.jdk/contents/home    1.6.0_65-b14-468, x86_64: "Java SE 6"/library/java/javavirtualmachines/1.6.0.jdk/contents/home    1.6.0_65-b14-468, I386: "Java SE 6"/library/java/javavirtualmachines/1.6.0.jdk/contents/home /library/java/javavirtualmachines /jdk1.8.0_91.jdk/contents/home//Open User profile//Add path: Export java_home=/library/java/javavirtualmachines/jdk1.8.0_91.jdk/contents/home?   ~ (Master)? Open ~/.profile//Refresh User Configuration after saving?   ~ (Master)? SOURCE ~/.profile?  ~ (Master)? Echo $JAVA _home/library/java/javavirtualmachines/jdk1.8.0_91.jdk/contents/home
Command crawl is deprecated bin/crawl instead

This error is shown when performing Bin/nutch crawl urls-depth 3-topn 5 o'clock, and the data found is because Nutch2.3.1 does not support this.

1.7 and 2.2.1 and above are replaced with Bin/crawl bin/nutch crawl. Correct wording:

Bin/crawl Url/test 5

Okay, I can do it, but the problem comes up again:

Exception in thread "main" java.lang.noclassdeffounderror:org/apache/avro/ipc/bytebufferoutputstream     at JAVA.LANG.CLASS.FORNAME0 (Native Method)     at java.lang.Class.forName ( class.java:191)     at Org.apache.nutch.storage.StorageUtils.getDataStoreClass ( storageutils.java:93)     at Org.apache.nutch.storage.StorageUtils.createWebStore ( storageutils.java:77)     at Org.apache.nutch.crawl.InjectorJob.run (injectorjob.java:218)     at Org.apache.nutch.crawl.InjectorJob.inject (injectorjob.java:252)      at Org.apache.nutch.crawl.InjectorJob.run (injectorjob.java:275)     at Org.apache.hadoop.util.ToolRunner.run (toolrunner.java:65)     at Org.apache.nutch.crawl.InjectorJob.main (injectorjob.java:284) caused by:java.lang.ClassNotFoundException: Org.apache.avro.ipc.bytebufferoutputstream    at Java.NET.URLClassloader$1.run (urlclassloader.java:366)     at java. Net.urlclassloader$1.run (urlclassloader.java:355)     at Java.security.AccessController.doPrivileged (Native Method)     at Java.net.URLClassLoader.findClass (urlclassloader.java:354)     at Java.lang.ClassLoader.loadClass (classloader.java:425)     at sun.misc.launcher$ Appclassloader.loadclass (launcher.java:308)     at Java.lang.ClassLoader.loadClass ( classloader.java:358)      9 more

The feeling of collapse, and then find out that the answer is that Nutch2.3.1 does not support mysql,what ....

The workaround is to:

    1. Either use the 2.2X version or return to use the nutch1.x version
    2. or replace MySQL for hbase storage

Show me the choice is to discard nutch2.3.1 using nutch2.2.1. Waste me a lot of time!

If the following error occurs, please search this article "special add" to resolve.

Exception in thread "main" Java.lang.RuntimeException:job Failed:name=generate:null, jobid=job_local200289520_0002 at Org.apache.nutch.util.NutchJob.waitForCompletion (nutchjob.java:55) at Org.apache.nutch.crawl.GeneratorJob.run ( generatorjob.java:199) at Org.apache.nutch.crawl.Crawler.runTool (crawler.java:68) at Org.apache.nutch.crawl.Crawler.run (crawler.java:152) at Org.apache.nutch.crawl.Crawler.run (crawler.java:250) at Org.apache.hadoop.util.ToolRunner.run (toolrunner.java:65) at Org.apache.nutch.crawl.Crawler.main (Crawler.java : 257)
nutch2.2.1 success
? Local (master)? Bin/nutch Crawl urls-depth 3-topn 5injectorjob:using class Org.apache.gora.sql.store.SqlStore as the Gora storage class .  Injectorjob:total number of URLs rejected by filters:0injectorjob:total number of URLs injected after normalization and Filtering:1fetcher:your ' Http.agent.name ' value should is listed first in ' Http.robots.agents ' property.  Fetcherjob:threads:10fetcherjob:parsing:falsefetcherjob:resuming:falsefetcherjob:timelimit set for: -1Using queue Mode:byhostfetcher:threads:10queuefeeder finished:total 1 records. Hit by Time limit:0

nutch2.2.1 installation and configuration and the same as above, where the details of the version number error, such as the error message correction on the line, the final success.

The previous section of the Nutch command describes what is seen in MySQL as a crawler crawl, such as:

Original address: http://www.bigdataway.net/node/502

MAC Nutch+mysql Integration Notes

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.