Apache Nutch 1.7 + SOLR 4.4.0 installation notes in the CentOS 6.4 environment

Source: Internet
Author: User
Tags auth curl generator http request solr time limit xsl truncated

I original, reproduced please indicate the source: http://blog.csdn.net/panjunbiao/article/details/12171147

Nutch installation reference documentation: Http://wiki.apache.org/nutch/NutchTutorial

Install the necessary procedures:
Yum Update
Yum List java*
Yum Install java-1.7.0-openjdk-devel.x86_64

Locate the installation path for Java:
Reference: Http://serverfaullt.com/questions/50883/what-is-the-value-of-java-home-for-centos
Set Java_home:
Reference: http://www.cnblogs.com/zhoulf/archive/2013/02/04/2891608.html

VI +/etc/profile

Java_home=/usr/lib/jvm/java
Jre_home=/usr/lib/jvm/java/jre
Path= $PATH: $JAVA _home/bin: $JRE _home/bin
Classpath=.: $JAVA _home/lib/dt.jar: $JAVA _home/lib/tools.jar: $JRE _home/lib
Export Java_home jre_home PATH CLASSPATH
Make the profile effective immediately:
Source/etc/profile

To download the binary package file:
Curl-o http://apache.fayea.com/apache-mirror/nutch/1.7/apache-nutch-1.7-bin.tar.gz

Unpack
Tar-xvzf apache-nutch-1.7-bin.tar.gz

Test Run File
CD apache-nutch-1.7
Bin/nutch
A usage help appears, indicating that the installation was successful.

Modify the file Conf/nutch-site.xml, set the name of the agent in the HTTP request:
<?xml version= "1.0"?>
<?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?>
<!--Put Site-specific property overrides in the this file. -
<configuration>
<property>
<name>http.agent.name</name>
<value>friendly crawler</value>
</property>
</configuration>

Create a Seed folder
Mkdir-p URLs

Perform the first crawler task:
Bin/nutch Crawl Urls-dir Crawl
Solrurl is not set, indexing'll be skipped ...
Crawl started in:crawl
Rooturldir = urls
Threads = Ten
Depth = 5
Solrurl=null
injector:starting at 201 3-09-29 12:01:30
injector:crawldb:crawl/crawldb
Injector:urlDir:urls
Injector:converting injected URLs to Crawl DB entries.
Injector:total number of URLs rejected by filters:0
Injector:total number of URLs injected after normalization a nd filtering:0
injector:merging injected URLs into crawl db.
Injector:finished at 2013-09-29 12:01:33, elapsed:00:00:03
generator:starting at 2013-09-29 12:01:33
Generator: Selecting best-scoring URLs due for fetch.
Generator:filtering:true
Generator:normalizing:true
Generator:jobtracker is ' local ', generating exactly One partition.
Generator:0 Records selected for fetching, exiting ...
Stopping at Depth=0-no + URLs to fetch.
No URLs to fetch-check your seed list and URL filters.
Crawl Finished:crawl
Since no seed URL has been set, the crawler exits without doing anything.

Write the seed URL to the file urls/seed.txt:
http://www.36kr.com/
VI Conf/regex-urlfilter.txt
# Accept anything else
# +.

# Added by Panjunbiao
+36kr.com

Execute the crawler again and discover that some of the seed sites have been skip:
Bin/nutch Crawl Urls-dir Crawl
Solrurl is not set, indexing'll be skipped ...
Crawl started In:crawl
Rooturldir = URLs
Threads = 10
depth = 5
Solrurl=null
Injector:starting at 2013-09-29 12:10:24
Injector:crawldb:crawl/crawldb
Injector:urlDir:urls
Injector:converting injected URLs to crawl DB entries.
Injector:total number of URLs rejected by filters:0
Injector:total number of URLs injected after normalization and filtering:1
Injector:merging injected URLs into crawl db.
Injector:finished at 2013-09-29 12:10:27, elapsed:00:00:03
Generator:starting at 2013-09-29 12:10:27
Generator:selecting best-scoring URLs due for fetch.
Generator:filtering:true
Generator:normalizing:true
Generator:jobtracker is ' local ', generating exactly one partition.
Generator:partitioning selected URLs for politeness.
generator:segment:crawl/segments/20130929121029
Generator:finished at 2013-09-29 12:10:30, elapsed:00:00:03
Fetcher:your ' Http.agent.name ' value should is listed first in ' Http.robots.agents ' property.
Fetcher:starting at 2013-09-29 12:10:30
fetcher:segment:crawl/segments/20130929121029
Using Queue Mode:byhost
Fetcher:threads:10
Fetcher:time-out Divisor:2
Queuefeeder finished:total 1 records + hits by Time limit:0
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Fetcher:throughput Threshold:-1
Fetcher:throughput Threshold Retries:5
Fetching http://www.36kr.com/(Queue crawl delay=5000ms)
-finishing thread Fetcherthread, activethreads=8
-finishing thread Fetcherthread, activethreads=7
-finishing thread Fetcherthread, activethreads=6
-finishing thread Fetcherthread, activethreads=5
-finishing thread Fetcherthread, activethreads=4
-finishing thread Fetcherthread, activethreads=3
-finishing thread Fetcherthread, activethreads=2
-finishing thread Fetcherthread, Activethreads=1
-finishing thread Fetcherthread, Activethreads=1
-finishing thread Fetcherthread, activethreads=0
-activethreads=0, spinwaiting=0, fetchqueues.totalsize=0
-activethreads=0
Fetcher:finished at 2013-09-29 12:10:32, elapsed:00:00:02
Parsesegment:starting at 2013-09-29 12:10:32
parsesegment:segment:crawl/segments/20130929121029
http://www.36kr.com/skipped. Content of size 67099 was truncated to 59363
Parsesegment:finished at 2013-09-29 12:10:33, elapsed:00:00:01
Crawldb update:starting at 2013-09-29 12:10:33
Crawldb Update:db:crawl/crawldb
Crawldb update:segments: [crawl/segments/20130929121029]
Crawldb update:additions Allowed:true
Crawldb Update:url Normalizing:true
Crawldb Update:url Filtering:true
Crawldb update:404 Purging:false
CRAWLDB update:merging segment data into DB.
Crawldb update:finished at 2013-09-29 12:10:34, elapsed:00:00:01
Generator:starting at 2013-09-29 12:10:34
Generator:selecting best-scoring URLs due for fetch.
Generator:filtering:true
Generator:normalizing:true
Generator:jobtracker is ' local ', generating exactly one partition.
Generator:0 Records selected for fetching, exiting ...
Stopping at Depth=1-no + URLs to fetch.
Linkdb:starting at 2013-09-29 12:10:35
Linkdb:linkdb:crawl/linkdb
Linkdb:url Normalize:true
Linkdb:url Filter:true
Linkdb:internal links'll be ignored.
Linkdb:adding segment:file:/root/apache-nutch-1.7/crawl/segments/20130929121029
Linkdb:finished at 2013-09-29 12:10:36, elapsed:00:00:01
Crawl Finished:crawl
Why is it. With tcpdump or Wireshark grab bag found that the site's page content in truncate way to return, and Nutch default settings are not handled this way, need to open it, modify Conf/nutch-site.xml, add a parser.skip.truncated Properties:
<property>
<name>parser.skip.truncated</name>
<value>false</value>
</property>
Reference: http://lucene.472066.n3.nabble.com/Content-Truncation-in-Nutch-2-1-MySQL-td4038888.html

After modifying the crawler task again, it has been able to crawl the normal:
Bin/nutch Crawl Urls-dir Crawl

SOLR InstallationDownload the installation file
Curl-o http://mirrors.cnnic.cn/apache/lucene/solr/4.4.0/solr-4.4.0.tgz

Tar-xvzf solr-4.4.0.tgz

CD Solr-4.4.0/example
Java-jar Start.jar

Verifying SOLR installation (assuming it is installed on this machine)
http://localhost:8983/solr/

integrated Nutch and SOLRVI +/etc/profile
Nutch_runtime_home=/root/apache-nutch-1.7apache_solr_home=/root/solr-4.4.0export JAVA_HOME JRE_HOME PATH CLASSPATH Nutch_runtime_home Apache_solr_home
Source/etc/profile

mkdir ${apache_solr_home}/example/solr/conf
CP ${nutch_runtime_home}/conf/schema.xml ${apache_solr_home}/example/solr/conf/

Restart SOLR's Start program
Java-jar Start.jar

To build an index:
Bin/nutch Crawl Urls-dir crawl-depth 2-topn 5-SOLR http://localhost:8983/solr/
Index Error:
Active indexwriters:
Solrindexwriter
Solr.server.url:URL of the SOLR instance (mandatory)
Solr.commit.size:buffer size when sending to SOLR (default 1000)
Solr.mapping.file:name of the mapping file for fields (default solrindex-mapping.xml)
Solr.auth:use authentication (default false)
Solr.auth.username:use authentication (default false)
Solr.auth:username for authentication
Solr.auth.password:password for authentication


Exception in thread "main" Java.io.IOException:Job failed!
At Org.apache.hadoop.mapred.JobClient.runJob (jobclient.java:1357)
At Org.apache.nutch.indexer.IndexingJob.index (indexingjob.java:123)
At Org.apache.nutch.indexer.IndexingJob.index (indexingjob.java:81)
At Org.apache.nutch.indexer.IndexingJob.index (indexingjob.java:65)
At Org.apache.nutch.crawl.Crawl.run (crawl.java:155)
At Org.apache.hadoop.util.ToolRunner.run (toolrunner.java:65)
At Org.apache.nutch.crawl.Crawl.main (crawl.java:55)
Check the SOLR log:
2859895 [qtp1478922764-16] INFO org.apache.solr.update.processor.LogUpdateProcessor? [Collection1] WEBAPP=/SOLR path=/update params={wt=javabin&version=2} {} 0 1
2859902 [qtp1478922764-16] ERROR org.apache.solr.core.SolrCore? Org.apache.solr.common.SolrException:ERROR: [doc=http://www.36kr.com/] Unknown field ' host '
At Org.apache.solr.update.DocumentBuilder.toDocument (documentbuilder.java:174)
At Org.apache.solr.update.AddUpdateCommand.getLuceneDocument (addupdatecommand.java:73)
At Org.apache.solr.update.DirectUpdateHandler2.addDoc (directupdatehandler2.java:210)
At Org.apache.solr.update.processor.RunUpdateProcessor.processAdd (runupdateprocessorfactory.java:69)
At Org.apache.solr.update.processor.UpdateRequestProcessor.processAdd (updaterequestprocessor.java:51)
At Org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd (distributedupdateprocessor.java:556)
At Org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd (distributedupdateprocessor.java:692)
At Org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd (distributedupdateprocessor.java:435)
At Org.apache.solr.update.processor.LogUpdateProcessor.processAdd (logupdateprocessorfactory.java:100)
At Org.apache.solr.handler.loader.XMLLoader.processUpdate (xmlloader.java:246)
At Org.apache.solr.handler.loader.XMLLoader.load (xmlloader.java:173)
At Org.apache.solr.handler.updaterequesthandler$1.load (updaterequesthandler.java:92)
At Org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody (contentstreamhandlerbase.java:74)
At Org.apache.solr.handler.RequestHandlerBase.handleRequest (requesthandlerbase.java:135)
At Org.apache.solr.core.SolrCore.execute (solrcore.java:1904)
At Org.apache.solr.servlet.SolrDispatchFilter.execute (solrdispatchfilter.java:659)
At Org.apache.solr.servlet.SolrDispatchFilter.doFilter (solrdispatchfilter.java:362)
At Org.apache.solr.servlet.SolrDispatchFilter.doFilter (solrdispatchfilter.java:158)
At Org.eclipse.jetty.servlet.servlethandler$cachedchain.dofilter (servlethandler.java:1419)
At Org.eclipse.jetty.servlet.ServletHandler.doHandle (servlethandler.java:455)
At Org.eclipse.jetty.server.handler.ScopedHandler.handle (scopedhandler.java:137)
At Org.eclipse.jetty.security.SecurityHandler.handle (securityhandler.java:557)
At Org.eclipse.jetty.server.session.SessionHandler.doHandle (sessionhandler.java:231)
At Org.eclipse.jetty.server.handler.ContextHandler.doHandle (contexthandler.java:1075)
At Org.eclipse.jetty.servlet.ServletHandler.doScope (servlethandler.java:384)
At Org.eclipse.jetty.server.session.SessionHandler.doScope (sessionhandler.java:193)
At Org.eclipse.jetty.server.handler.ContextHandler.doScope (contexthandler.java:1009)
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.