I original, reproduced please indicate the source: http://blog.csdn.net/panjunbiao/article/details/12171147
Nutch installation reference documentation: Http://wiki.apache.org/nutch/NutchTutorial
Install the necessary procedures:
Yum Update
Yum List java*
Yum Install java-1.7.0-openjdk-devel.x86_64
Locate the installation path for Java:
Reference: Http://serverfaullt.com/questions/50883/what-is-the-value-of-java-home-for-centos
Set Java_home:
Reference: http://www.cnblogs.com/zhoulf/archive/2013/02/04/2891608.html
VI +/etc/profile
Java_home=/usr/lib/jvm/java Jre_home=/usr/lib/jvm/java/jre Path= $PATH: $JAVA _home/bin: $JRE _home/bin Classpath=.: $JAVA _home/lib/dt.jar: $JAVA _home/lib/tools.jar: $JRE _home/lib Export Java_home jre_home PATH CLASSPATH |
Make the profile effective immediately:
Source/etc/profile
To download the binary package file:
Curl-o http://apache.fayea.com/apache-mirror/nutch/1.7/apache-nutch-1.7-bin.tar.gz
Unpack
Tar-xvzf apache-nutch-1.7-bin.tar.gz
Test Run File
CD apache-nutch-1.7
Bin/nutch
A usage help appears, indicating that the installation was successful.
Modify the file Conf/nutch-site.xml, set the name of the agent in the HTTP request:
<?xml version= "1.0"?> <?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?> <!--Put Site-specific property overrides in the this file. - <configuration> <property> <name>http.agent.name</name> <value>friendly crawler</value> </property> </configuration> |
Create a Seed folder
Mkdir-p URLs
Perform the first crawler task:
Bin/nutch Crawl Urls-dir Crawl
Solrurl is not set, indexing'll be skipped ... Crawl started in:crawl Rooturldir = urls Threads = Ten Depth = 5 Solrurl=null injector:starting at 201 3-09-29 12:01:30 injector:crawldb:crawl/crawldb Injector:urlDir:urls Injector:converting injected URLs to Crawl DB entries. Injector:total number of URLs rejected by filters:0 Injector:total number of URLs injected after normalization a nd filtering:0 injector:merging injected URLs into crawl db. Injector:finished at 2013-09-29 12:01:33, elapsed:00:00:03 generator:starting at 2013-09-29 12:01:33 Generator: Selecting best-scoring URLs due for fetch. Generator:filtering:true Generator:normalizing:true Generator:jobtracker is ' local ', generating exactly One partition. Generator:0 Records selected for fetching, exiting ... Stopping at Depth=0-no + URLs to fetch. No URLs to fetch-check your seed list and URL filters. Crawl Finished:crawl |
Since no seed URL has been set, the crawler exits without doing anything.
Write the seed URL to the file urls/seed.txt:
VI Conf/regex-urlfilter.txt
# Accept anything else # +.
# Added by Panjunbiao +36kr.com |
Execute the crawler again and discover that some of the seed sites have been skip:
Bin/nutch Crawl Urls-dir Crawl
Solrurl is not set, indexing'll be skipped ... Crawl started In:crawl Rooturldir = URLs Threads = 10 depth = 5 Solrurl=null Injector:starting at 2013-09-29 12:10:24 Injector:crawldb:crawl/crawldb Injector:urlDir:urls Injector:converting injected URLs to crawl DB entries. Injector:total number of URLs rejected by filters:0 Injector:total number of URLs injected after normalization and filtering:1 Injector:merging injected URLs into crawl db. Injector:finished at 2013-09-29 12:10:27, elapsed:00:00:03 Generator:starting at 2013-09-29 12:10:27 Generator:selecting best-scoring URLs due for fetch. Generator:filtering:true Generator:normalizing:true Generator:jobtracker is ' local ', generating exactly one partition. Generator:partitioning selected URLs for politeness. generator:segment:crawl/segments/20130929121029 Generator:finished at 2013-09-29 12:10:30, elapsed:00:00:03 Fetcher:your ' Http.agent.name ' value should is listed first in ' Http.robots.agents ' property. Fetcher:starting at 2013-09-29 12:10:30 fetcher:segment:crawl/segments/20130929121029 Using Queue Mode:byhost Fetcher:threads:10 Fetcher:time-out Divisor:2 Queuefeeder finished:total 1 records + hits by Time limit:0 Using Queue Mode:byhost Using Queue Mode:byhost Using Queue Mode:byhost Using Queue Mode:byhost Using Queue Mode:byhost Using Queue Mode:byhost Using Queue Mode:byhost Using Queue Mode:byhost Using Queue Mode:byhost Using Queue Mode:byhost Fetcher:throughput Threshold:-1 Fetcher:throughput Threshold Retries:5 Fetching http://www.36kr.com/(Queue crawl delay=5000ms) -finishing thread Fetcherthread, activethreads=8 -finishing thread Fetcherthread, activethreads=7 -finishing thread Fetcherthread, activethreads=6 -finishing thread Fetcherthread, activethreads=5 -finishing thread Fetcherthread, activethreads=4 -finishing thread Fetcherthread, activethreads=3 -finishing thread Fetcherthread, activethreads=2 -finishing thread Fetcherthread, Activethreads=1 -finishing thread Fetcherthread, Activethreads=1 -finishing thread Fetcherthread, activethreads=0 -activethreads=0, spinwaiting=0, fetchqueues.totalsize=0 -activethreads=0 Fetcher:finished at 2013-09-29 12:10:32, elapsed:00:00:02 Parsesegment:starting at 2013-09-29 12:10:32 parsesegment:segment:crawl/segments/20130929121029 http://www.36kr.com/skipped. Content of size 67099 was truncated to 59363 Parsesegment:finished at 2013-09-29 12:10:33, elapsed:00:00:01 Crawldb update:starting at 2013-09-29 12:10:33 Crawldb Update:db:crawl/crawldb Crawldb update:segments: [crawl/segments/20130929121029] Crawldb update:additions Allowed:true Crawldb Update:url Normalizing:true Crawldb Update:url Filtering:true Crawldb update:404 Purging:false CRAWLDB update:merging segment data into DB. Crawldb update:finished at 2013-09-29 12:10:34, elapsed:00:00:01 Generator:starting at 2013-09-29 12:10:34 Generator:selecting best-scoring URLs due for fetch. Generator:filtering:true Generator:normalizing:true Generator:jobtracker is ' local ', generating exactly one partition. Generator:0 Records selected for fetching, exiting ... Stopping at Depth=1-no + URLs to fetch. Linkdb:starting at 2013-09-29 12:10:35 Linkdb:linkdb:crawl/linkdb Linkdb:url Normalize:true Linkdb:url Filter:true Linkdb:internal links'll be ignored. Linkdb:adding segment:file:/root/apache-nutch-1.7/crawl/segments/20130929121029 Linkdb:finished at 2013-09-29 12:10:36, elapsed:00:00:01 Crawl Finished:crawl |
Why is it. With tcpdump or Wireshark grab bag found that the site's page content in truncate way to return, and Nutch default settings are not handled this way, need to open it, modify Conf/nutch-site.xml, add a parser.skip.truncated Properties:
<property>
<name>parser.skip.truncated</name>
<value>false</value>
</property>
Reference: http://lucene.472066.n3.nabble.com/Content-Truncation-in-Nutch-2-1-MySQL-td4038888.html
After modifying the crawler task again, it has been able to crawl the normal:
Bin/nutch Crawl Urls-dir Crawl
SOLR InstallationDownload the installation file
Curl-o http://mirrors.cnnic.cn/apache/lucene/solr/4.4.0/solr-4.4.0.tgz
Tar-xvzf solr-4.4.0.tgz
CD Solr-4.4.0/example
Java-jar Start.jar
Verifying SOLR installation (assuming it is installed on this machine)
http://localhost:8983/solr/
integrated Nutch and SOLRVI +/etc/profile
Nutch_runtime_home=/root/apache-nutch-1.7apache_solr_home=/root/solr-4.4.0export JAVA_HOME JRE_HOME PATH CLASSPATH Nutch_runtime_home Apache_solr_home |
Source/etc/profile
mkdir ${apache_solr_home}/example/solr/conf
CP ${nutch_runtime_home}/conf/schema.xml ${apache_solr_home}/example/solr/conf/
Restart SOLR's Start program
Java-jar Start.jar
To build an index:
Bin/nutch Crawl Urls-dir crawl-depth 2-topn 5-SOLR http://localhost:8983/solr/
Index Error:
Active indexwriters: Solrindexwriter Solr.server.url:URL of the SOLR instance (mandatory) Solr.commit.size:buffer size when sending to SOLR (default 1000) Solr.mapping.file:name of the mapping file for fields (default solrindex-mapping.xml) Solr.auth:use authentication (default false) Solr.auth.username:use authentication (default false) Solr.auth:username for authentication Solr.auth.password:password for authentication
Exception in thread "main" Java.io.IOException:Job failed! At Org.apache.hadoop.mapred.JobClient.runJob (jobclient.java:1357) At Org.apache.nutch.indexer.IndexingJob.index (indexingjob.java:123) At Org.apache.nutch.indexer.IndexingJob.index (indexingjob.java:81) At Org.apache.nutch.indexer.IndexingJob.index (indexingjob.java:65) At Org.apache.nutch.crawl.Crawl.run (crawl.java:155) At Org.apache.hadoop.util.ToolRunner.run (toolrunner.java:65) At Org.apache.nutch.crawl.Crawl.main (crawl.java:55) |
Check the SOLR log:
2859895 [qtp1478922764-16] INFO org.apache.solr.update.processor.LogUpdateProcessor? [Collection1] WEBAPP=/SOLR path=/update params={wt=javabin&version=2} {} 0 1 2859902 [qtp1478922764-16] ERROR org.apache.solr.core.SolrCore? Org.apache.solr.common.SolrException:ERROR: [doc=http://www.36kr.com/] Unknown field ' host ' At Org.apache.solr.update.DocumentBuilder.toDocument (documentbuilder.java:174) At Org.apache.solr.update.AddUpdateCommand.getLuceneDocument (addupdatecommand.java:73) At Org.apache.solr.update.DirectUpdateHandler2.addDoc (directupdatehandler2.java:210) At Org.apache.solr.update.processor.RunUpdateProcessor.processAdd (runupdateprocessorfactory.java:69) At Org.apache.solr.update.processor.UpdateRequestProcessor.processAdd (updaterequestprocessor.java:51) At Org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd (distributedupdateprocessor.java:556) At Org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd (distributedupdateprocessor.java:692) At Org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd (distributedupdateprocessor.java:435) At Org.apache.solr.update.processor.LogUpdateProcessor.processAdd (logupdateprocessorfactory.java:100) At Org.apache.solr.handler.loader.XMLLoader.processUpdate (xmlloader.java:246) At Org.apache.solr.handler.loader.XMLLoader.load (xmlloader.java:173) At Org.apache.solr.handler.updaterequesthandler$1.load (updaterequesthandler.java:92) At Org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody (contentstreamhandlerbase.java:74) At Org.apache.solr.handler.RequestHandlerBase.handleRequest (requesthandlerbase.java:135) At Org.apache.solr.core.SolrCore.execute (solrcore.java:1904) At Org.apache.solr.servlet.SolrDispatchFilter.execute (solrdispatchfilter.java:659) At Org.apache.solr.servlet.SolrDispatchFilter.doFilter (solrdispatchfilter.java:362) At Org.apache.solr.servlet.SolrDispatchFilter.doFilter (solrdispatchfilter.java:158) At Org.eclipse.jetty.servlet.servlethandler$cachedchain.dofilter (servlethandler.java:1419) At Org.eclipse.jetty.servlet.ServletHandler.doHandle (servlethandler.java:455) At Org.eclipse.jetty.server.handler.ScopedHandler.handle (scopedhandler.java:137) At Org.eclipse.jetty.security.SecurityHandler.handle (securityhandler.java:557) At Org.eclipse.jetty.server.session.SessionHandler.doHandle (sessionhandler.java:231) At Org.eclipse.jetty.server.handler.ContextHandler.doHandle (contexthandler.java:1075) At Org.eclipse.jetty.servlet.ServletHandler.doScope (servlethandler.java:384) At Org.eclipse.jetty.server.session.SessionHandler.doScope (sessionhandler.java:193) At Org.eclipse.jetty.server.handler.ContextHandler.doScope (contexthandler.java:1009)
|