<div property="schema:text" class="field field--name-body field--type-text-with-summary field--label-hidden field__item"><p> 目的: <em>Nutch爬虫引擎抓取的数据自动存入MySQL</em> </p>
Affiliation: nutch+hadoop+hbase (MySQL) +elasticsearch+php Series Practice
MAC MySQL Installation
No configuration is required, that is, next remembers the password in the popup window.
: http://dev.mysql.com/downloads/mysql/
Nutch installation and configuration and use of
1, Nutch-2.3.1 download: http://nutch.apache.org/downloads.html download, and then extract to the local installation directory, such as the local root directory is ${nutch_home};
2, configuration NUTCH support for MySQL, modify the ${apache_nutch_home}/ivy/ivy.xml file, respectively:
1) Locate the following line to uncomment
<dependency org= "MySQL" name= "Mysql-connector-java" rev= "5.1.18" conf= "*->default"/>
2) Modify the following line
Default is
<dependency org= "Org.apache.gora" name= "Gora-core" rev= "0.3" conf= "*->default"/>
After modification to
<dependency org= "Org.apache.gora" name= "Gora-core" rev= "0.2.1" conf= "*->default"/>
3) Uncomment the line
<dependency org= "Org.apache.gora" name= "Gora-sql" rev= "0.1.1-incubating" conf= "*->default"/>
Note: on 2), 3) If you do not modify the exception information is
Exception in thread "main" Java.lang.ClassNotFoundException:org.apache.gora.sql.store.SqlStore
3. Database Connection Configuration
Edit the ${nutch_home}/conf/gora.properties file, comment out the default database connection configuration, and add the following configuration content:
################################ MySQL Properties ############################## #gora. sqlstore.jdbc.driver=com.my sql.jdbc.drivergora.sqlstore.jdbc.url=jdbc:mysql://192.168.58.1:3306/nutch?createdatabaseifnotexist= truegora.sqlstore.jdbc.user=rootgora.sqlstore.jdbc.password=
Write the database address and user name password you need to connect
4. Modify the Nutch-site configuration file
Add the following to the configuration node in ${nutch_home}/conf/nutch-site.xml
<property><name>http.agent.name</name><value>liuxun Nutch Spider</value></ PROPERTY>&NBSP;<PROPERTY><NAME>HTTP.ACCEPT.LANGUAGE</NAME><VALUE>JA-JP, EN-US,EN-GB, En;q=0.7,*;q=0.3</value><description>value of the "Accept-language" Request header field. This allows selecting Non-english language as the default one to retrieve. It's a useful setting for search engines build for certain national group.</description></property> < property><name>parser.character.encoding.default</name><value>utf-8</value>< description>the character encoding to fall back to when no other Informationis available</description></prope Rty> <property><name>storage.data.store.class</name><value> Org.apache.gora.sql.store.sqlstore</value><description>the Gora DataStore class for storing and Retrieving data. Currently the following stores is available: .... </description> </property>//Special Additions <property> <name> Generate.batch.id</name> <value>*</value></property>
5. Compiling Nutch-2.3.1
- Enter the ${nutch_home} directory to execute the ant command: Ant runtime
- After the compilation succeeds, there will be a runtime directory under the ${nutch_home} directory.
Compiling Nutch
? apache-nutch-2.3.1 (Master)? Antbuildfile:/users/hackgyj/apache-nutch-2.3.1/build.xmltryingto overrideolddefinitionoftaskjavac [ TASKDEF] Couldnot loaddefinitionsfromresourceorg/sonar/ant/antlib.xml. Itcouldnot Befound. ivy-probe-antlib: ivy-download: [taskdef] Couldnot Loaddefinitionsfromresourceorg/sonar/ant/antlib.xml. Itcouldnot befound. ivy-download-unchecked: ivy-init-antlib: ivy-init: init: [mkdir] Createddir:/users/hackgyj/apache-nutch-2.3.1/build [mkdir] Createddir :/users/hackgyj/apache-nutch-2.3.1/build/classes [mkdir] Createddir:/Users/hackgyj/ Apache-nutch-2.3.1/build/release [mkdir] Createddir:/USERS/HACKGYJ/APACHE-NUTCH-2.3.1/ Build/test [mkdir] Createddir:/users/hackgyj/apache-nutch-2.3.1/build/test/classes Clean-lib: resolve-default:[ivy:resolve]:: Apacheivy 2.3.0-201301101427: http://ant.apache.org/ivy/:: [Ivy:resolve]:: loadingsettings:: File =/users/hackgyj/apache-nutch-2.3.1/ivy/ Ivysettings.xml
The above error, need to download sonar jar package (Sonar-ant-task-2.2.jar), and put the jar package into the extracted apache-nutch-2.3.1 folder within the Lib file. Due to the need to connect to network download resources, it takes some time, depending on the network situation time, I took about an hour myself!
The command line then executes:
Ant Clear
Re-execute
Ant Runtime
OK, no error, compile successfully, the directory is more than: Build, runtime two folders, where runtime is a compiled directory.
6. Web page Crawl and configuration
- Enter the ${nutch_home}/runtime/local directory
- Set Crawl Sites
Execute command
Mkdir-p URLs//Suggested Crawler Connection folder echo ' http://www.oschina.net/' > urls/seed.txt//write crawl connection Bin/nutch crawl urls-depth 3-topn 5 Start a crawler job
Error:java_home is not set.
Tip Java_home Not set
MAC OS X El Capitan 10.11.6 Find and set $java_home, the command is as follows
? ~ (Master)? whichjava/usr/bin/java? ~ (Master)? Ls-l/usr/bin/javalrwxr-xr-x 1 root wheel 74 Oct 20 2015/usr/bin/java- >/system/library/frameworks/javavm.framework/versions/current/commands/java? ~ (Master)? Ls-l/system/library/frameworks/javavm.framework/versionstotal 64lrwxr-xr-x 1 Root wheel &NBSP;&NBSP;10 Oct 20 2015 1.4-currentjdklrwxr-xr-x 1 root wheel Oct 20 2015 1.4.2-currentjdklrwxr-xr-x 1 root wheel 10 Oct 20 &NBSP;&NBSP;2015 1.5-currentjdklrwxr-xr-x 1 root wheel 10 Oct 20 1.5.0-currentjdklrwxr-xr-x 1 root wheel 10 Oct 20 2015 1.6- > currentjdklrwxr-xr-x 1 root wheel 10 Oct 20 2015 1.6.0 Currentjdkdrwxr-xr-x 10 root wheel 340 oct 7 13:55 alrwxr-xr-x 1 root Wheel 1 Oct 20 2015 Current, alrwxr-xr-x 1 Root wheel 52 Oct 20 2015 CURRENTJDK-/library/java/javavirtualmachines/1.6.0.jdk/contents? ~ (Master)? Java-versionjavaversion "1.8.0_91" Java (tm) seruntimeenvironment (build 1.8.0_91-b14) Javahotspot (tm) 64-BITSERVERVM ( Build 25.91-b14, Mixedmode)? ~ (Master)? /usr/libexec/java_home-vmatchingjavavirtualmachines (3): 1.8.0_91, x86_64: "Java SE 8"/ library/java/javavirtualmachines/jdk1.8.0_91.jdk/contents/home 1.6.0_65-b14-468, x86_64: "Java SE 6"/library/java/javavirtualmachines/1.6.0.jdk/contents/home 1.6.0_65-b14-468, I386: "Java SE 6"/library/java/javavirtualmachines/1.6.0.jdk/contents/home /library/java/javavirtualmachines /jdk1.8.0_91.jdk/contents/home//Open User profile//Add path: Export java_home=/library/java/javavirtualmachines/jdk1.8.0_91.jdk/contents/home? ~ (Master)? Open ~/.profile//Refresh User Configuration after saving? ~ (Master)? SOURCE ~/.profile? ~ (Master)? Echo $JAVA _home/library/java/javavirtualmachines/jdk1.8.0_91.jdk/contents/home
Command crawl is deprecated bin/crawl instead
This error is shown when performing Bin/nutch crawl urls-depth 3-topn 5 o'clock, and the data found is because Nutch2.3.1 does not support this.
1.7 and 2.2.1 and above are replaced with Bin/crawl bin/nutch crawl. Correct wording:
Bin/crawl Url/test 5
Okay, I can do it, but the problem comes up again:
Exception in thread "main" java.lang.noclassdeffounderror:org/apache/avro/ipc/bytebufferoutputstream at JAVA.LANG.CLASS.FORNAME0 (Native Method) at java.lang.Class.forName ( class.java:191) at Org.apache.nutch.storage.StorageUtils.getDataStoreClass ( storageutils.java:93) at Org.apache.nutch.storage.StorageUtils.createWebStore ( storageutils.java:77) at Org.apache.nutch.crawl.InjectorJob.run (injectorjob.java:218) at Org.apache.nutch.crawl.InjectorJob.inject (injectorjob.java:252) at Org.apache.nutch.crawl.InjectorJob.run (injectorjob.java:275) at Org.apache.hadoop.util.ToolRunner.run (toolrunner.java:65) at Org.apache.nutch.crawl.InjectorJob.main (injectorjob.java:284) caused by:java.lang.ClassNotFoundException: Org.apache.avro.ipc.bytebufferoutputstream at Java.NET.URLClassloader$1.run (urlclassloader.java:366) at java. Net.urlclassloader$1.run (urlclassloader.java:355) at Java.security.AccessController.doPrivileged (Native Method) at Java.net.URLClassLoader.findClass (urlclassloader.java:354) at Java.lang.ClassLoader.loadClass (classloader.java:425) at sun.misc.launcher$ Appclassloader.loadclass (launcher.java:308) at Java.lang.ClassLoader.loadClass ( classloader.java:358) 9 more
The feeling of collapse, and then find out that the answer is that Nutch2.3.1 does not support mysql,what ....
The workaround is to:
- Either use the 2.2X version or return to use the nutch1.x version
- or replace MySQL for hbase storage
Show me the choice is to discard nutch2.3.1 using nutch2.2.1. Waste me a lot of time!
If the following error occurs, please search this article "special add" to resolve.
Exception in thread "main" Java.lang.RuntimeException:job Failed:name=generate:null, jobid=job_local200289520_0002 at Org.apache.nutch.util.NutchJob.waitForCompletion (nutchjob.java:55) at Org.apache.nutch.crawl.GeneratorJob.run ( generatorjob.java:199) at Org.apache.nutch.crawl.Crawler.runTool (crawler.java:68) at Org.apache.nutch.crawl.Crawler.run (crawler.java:152) at Org.apache.nutch.crawl.Crawler.run (crawler.java:250) at Org.apache.hadoop.util.ToolRunner.run (toolrunner.java:65) at Org.apache.nutch.crawl.Crawler.main (Crawler.java : 257)
nutch2.2.1 success
? Local (master)? Bin/nutch Crawl urls-depth 3-topn 5injectorjob:using class Org.apache.gora.sql.store.SqlStore as the Gora storage class . Injectorjob:total number of URLs rejected by filters:0injectorjob:total number of URLs injected after normalization and Filtering:1fetcher:your ' Http.agent.name ' value should is listed first in ' Http.robots.agents ' property. Fetcherjob:threads:10fetcherjob:parsing:falsefetcherjob:resuming:falsefetcherjob:timelimit set for: -1Using queue Mode:byhostfetcher:threads:10queuefeeder finished:total 1 records. Hit by Time limit:0
nutch2.2.1 installation and configuration and the same as above, where the details of the version number error, such as the error message correction on the line, the final success.
The previous section of the Nutch command describes what is seen in MySQL as a crawler crawl, such as:
Original address: http://www.bigdataway.net/node/502
MAC Nutch+mysql Integration Notes