Apache Nutch 1.7 + SOLR 4.4.0 installation notes in the CentOS 6.4 environment

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I original, reproduced please indicate the source: http://blog.csdn.net/panjunbiao/article/details/12171147

Nutch installation reference documentation: Http://wiki.apache.org/nutch/NutchTutorial

Install the necessary procedures:
Yum Update
Yum List java*
Yum Install java-1.7.0-openjdk-devel.x86_64

Locate the installation path for Java:
Reference: Http://serverfaullt.com/questions/50883/what-is-the-value-of-java-home-for-centos
Set Java_home:
Reference: http://www.cnblogs.com/zhoulf/archive/2013/02/04/2891608.html

VI +/etc/profile

Java_home=/usr/lib/jvm/java
Jre_home=/usr/lib/jvm/java/jre
Path= $PATH: $JAVA _home/bin: $JRE _home/bin
Classpath=.: $JAVA _home/lib/dt.jar: $JAVA _home/lib/tools.jar: $JRE _home/lib
Export Java_home jre_home PATH CLASSPATH

Make the profile effective immediately:
Source/etc/profile

To download the binary package file:
Curl-o http://apache.fayea.com/apache-mirror/nutch/1.7/apache-nutch-1.7-bin.tar.gz

Unpack
Tar-xvzf apache-nutch-1.7-bin.tar.gz

Test Run File
CD apache-nutch-1.7
Bin/nutch
A usage help appears, indicating that the installation was successful.

Modify the file Conf/nutch-site.xml, set the name of the agent in the HTTP request:

<?xml version= "1.0"?>
<?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?>
<!--Put Site-specific property overrides in the this file. -
<configuration>
<property>
<name>http.agent.name</name>
<value>friendly crawler</value>
</property>
</configuration>

Create a Seed folder
Mkdir-p URLs

Perform the first crawler task:
Bin/nutch Crawl Urls-dir Crawl

Solrurl is not set, indexing'll be skipped ...
Crawl started in:crawl
Rooturldir = urls
Threads = Ten
Depth = 5
Solrurl=null
injector:starting at 201 3-09-29 12:01:30
injector:crawldb:crawl/crawldb
Injector:urlDir:urls
Injector:converting injected URLs to Crawl DB entries.
Injector:total number of URLs rejected by filters:0
Injector:total number of URLs injected after normalization a nd filtering:0
injector:merging injected URLs into crawl db.
Injector:finished at 2013-09-29 12:01:33, elapsed:00:00:03
generator:starting at 2013-09-29 12:01:33
Generator: Selecting best-scoring URLs due for fetch.
Generator:filtering:true
Generator:normalizing:true
Generator:jobtracker is ' local ', generating exactly One partition.
Generator:0 Records selected for fetching, exiting ...
Stopping at Depth=0-no + URLs to fetch.
No URLs to fetch-check your seed list and URL filters.
Crawl Finished:crawl

Since no seed URL has been set, the crawler exits without doing anything.

Write the seed URL to the file urls/seed.txt:

http://www.36kr.com/

VI Conf/regex-urlfilter.txt

# Accept anything else
# +.

# Added by Panjunbiao
+36kr.com

Execute the crawler again and discover that some of the seed sites have been skip:
Bin/nutch Crawl Urls-dir Crawl

Solrurl is not set, indexing'll be skipped ...
Crawl started In:crawl
Rooturldir = URLs
Threads = 10
depth = 5
Solrurl=null
Injector:starting at 2013-09-29 12:10:24
Injector:crawldb:crawl/crawldb
Injector:urlDir:urls
Injector:converting injected URLs to crawl DB entries.
Injector:total number of URLs rejected by filters:0
Injector:total number of URLs injected after normalization and filtering:1
Injector:merging injected URLs into crawl db.
Injector:finished at 2013-09-29 12:10:27, elapsed:00:00:03
Generator:starting at 2013-09-29 12:10:27
Generator:selecting best-scoring URLs due for fetch.
Generator:filtering:true
Generator:normalizing:true
Generator:jobtracker is ' local ', generating exactly one partition.
Generator:partitioning selected URLs for politeness.
generator:segment:crawl/segments/20130929121029
Generator:finished at 2013-09-29 12:10:30, elapsed:00:00:03
Fetcher:your ' Http.agent.name ' value should is listed first in ' Http.robots.agents ' property.
Fetcher:starting at 2013-09-29 12:10:30
fetcher:segment:crawl/segments/20130929121029
Using Queue Mode:byhost
Fetcher:threads:10
Fetcher:time-out Divisor:2
Queuefeeder finished:total 1 records + hits by Time limit:0
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Using Queue Mode:byhost
Fetcher:throughput Threshold:-1
Fetcher:throughput Threshold Retries:5
Fetching http://www.36kr.com/(Queue crawl delay=5000ms)
-finishing thread Fetcherthread, activethreads=8
-finishing thread Fetcherthread, activethreads=7
-finishing thread Fetcherthread, activethreads=6
-finishing thread Fetcherthread, activethreads=5
-finishing thread Fetcherthread, activethreads=4
-finishing thread Fetcherthread, activethreads=3
-finishing thread Fetcherthread, activethreads=2
-finishing thread Fetcherthread, Activethreads=1
-finishing thread Fetcherthread, Activethreads=1
-finishing thread Fetcherthread, activethreads=0
-activethreads=0, spinwaiting=0, fetchqueues.totalsize=0
-activethreads=0
Fetcher:finished at 2013-09-29 12:10:32, elapsed:00:00:02
Parsesegment:starting at 2013-09-29 12:10:32
parsesegment:segment:crawl/segments/20130929121029
http://www.36kr.com/skipped. Content of size 67099 was truncated to 59363
Parsesegment:finished at 2013-09-29 12:10:33, elapsed:00:00:01
Crawldb update:starting at 2013-09-29 12:10:33
Crawldb Update:db:crawl/crawldb
Crawldb update:segments: [crawl/segments/20130929121029]
Crawldb update:additions Allowed:true
Crawldb Update:url Normalizing:true
Crawldb Update:url Filtering:true
Crawldb update:404 Purging:false
CRAWLDB update:merging segment data into DB.
Crawldb update:finished at 2013-09-29 12:10:34, elapsed:00:00:01
Generator:starting at 2013-09-29 12:10:34
Generator:selecting best-scoring URLs due for fetch.
Generator:filtering:true
Generator:normalizing:true
Generator:jobtracker is ' local ', generating exactly one partition.
Generator:0 Records selected for fetching, exiting ...
Stopping at Depth=1-no + URLs to fetch.
Linkdb:starting at 2013-09-29 12:10:35
Linkdb:linkdb:crawl/linkdb
Linkdb:url Normalize:true
Linkdb:url Filter:true
Linkdb:internal links'll be ignored.
Linkdb:adding segment:file:/root/apache-nutch-1.7/crawl/segments/20130929121029
Linkdb:finished at 2013-09-29 12:10:36, elapsed:00:00:01
Crawl Finished:crawl

Why is it. With tcpdump or Wireshark grab bag found that the site's page content in truncate way to return, and Nutch default settings are not handled this way, need to open it, modify Conf/nutch-site.xml, add a parser.skip.truncated Properties:
<property>
<name>parser.skip.truncated</name>
<value>false</value>
</property>
Reference: http://lucene.472066.n3.nabble.com/Content-Truncation-in-Nutch-2-1-MySQL-td4038888.html

After modifying the crawler task again, it has been able to crawl the normal:
Bin/nutch Crawl Urls-dir Crawl

SOLR InstallationDownload the installation file
Curl-o http://mirrors.cnnic.cn/apache/lucene/solr/4.4.0/solr-4.4.0.tgz

Tar-xvzf solr-4.4.0.tgz

CD Solr-4.4.0/example
Java-jar Start.jar

Verifying SOLR installation (assuming it is installed on this machine)
http://localhost:8983/solr/

integrated Nutch and SOLRVI +/etc/profile

Nutch_runtime_home=/root/apache-nutch-1.7apache_solr_home=/root/solr-4.4.0export JAVA_HOME JRE_HOME PATH CLASSPATH Nutch_runtime_home Apache_solr_home

Source/etc/profile

mkdir ${apache_solr_home}/example/solr/conf
CP ${nutch_runtime_home}/conf/schema.xml ${apache_solr_home}/example/solr/conf/

Restart SOLR's Start program
Java-jar Start.jar

To build an index:
Bin/nutch Crawl Urls-dir crawl-depth 2-topn 5-SOLR http://localhost:8983/solr/
Index Error:

Active indexwriters:
Solrindexwriter
Solr.server.url:URL of the SOLR instance (mandatory)
Solr.commit.size:buffer size when sending to SOLR (default 1000)
Solr.mapping.file:name of the mapping file for fields (default solrindex-mapping.xml)
Solr.auth:use authentication (default false)
Solr.auth.username:use authentication (default false)
Solr.auth:username for authentication
Solr.auth.password:password for authentication

Exception in thread "main" Java.io.IOException:Job failed!
At Org.apache.hadoop.mapred.JobClient.runJob (jobclient.java:1357)
At Org.apache.nutch.indexer.IndexingJob.index (indexingjob.java:123)
At Org.apache.nutch.indexer.IndexingJob.index (indexingjob.java:81)
At Org.apache.nutch.indexer.IndexingJob.index (indexingjob.java:65)
At Org.apache.nutch.crawl.Crawl.run (crawl.java:155)
At Org.apache.hadoop.util.ToolRunner.run (toolrunner.java:65)
At Org.apache.nutch.crawl.Crawl.main (crawl.java:55)

Check the SOLR log:

2859895 [qtp1478922764-16] INFO org.apache.solr.update.processor.LogUpdateProcessor? [Collection1] WEBAPP=/SOLR path=/update params={wt=javabin&version=2} {} 0 1
2859902 [qtp1478922764-16] ERROR org.apache.solr.core.SolrCore? Org.apache.solr.common.SolrException:ERROR: [doc=http://www.36kr.com/] Unknown field ' host '
At Org.apache.solr.update.DocumentBuilder.toDocument (documentbuilder.java:174)
At Org.apache.solr.update.AddUpdateCommand.getLuceneDocument (addupdatecommand.java:73)
At Org.apache.solr.update.DirectUpdateHandler2.addDoc (directupdatehandler2.java:210)
At Org.apache.solr.update.processor.RunUpdateProcessor.processAdd (runupdateprocessorfactory.java:69)
At Org.apache.solr.update.processor.UpdateRequestProcessor.processAdd (updaterequestprocessor.java:51)
At Org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd (distributedupdateprocessor.java:556)
At Org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd (distributedupdateprocessor.java:692)
At Org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd (distributedupdateprocessor.java:435)
At Org.apache.solr.update.processor.LogUpdateProcessor.processAdd (logupdateprocessorfactory.java:100)
At Org.apache.solr.handler.loader.XMLLoader.processUpdate (xmlloader.java:246)
At Org.apache.solr.handler.loader.XMLLoader.load (xmlloader.java:173)
At Org.apache.solr.handler.updaterequesthandler$1.load (updaterequesthandler.java:92)
At Org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody (contentstreamhandlerbase.java:74)
At Org.apache.solr.handler.RequestHandlerBase.handleRequest (requesthandlerbase.java:135)
At Org.apache.solr.core.SolrCore.execute (solrcore.java:1904)
At Org.apache.solr.servlet.SolrDispatchFilter.execute (solrdispatchfilter.java:659)
At Org.apache.solr.servlet.SolrDispatchFilter.doFilter (solrdispatchfilter.java:362)
At Org.apache.solr.servlet.SolrDispatchFilter.doFilter (solrdispatchfilter.java:158)
At Org.eclipse.jetty.servlet.servlethandler$cachedchain.dofilter (servlethandler.java:1419)
At Org.eclipse.jetty.servlet.ServletHandler.doHandle (servlethandler.java:455)
At Org.eclipse.jetty.server.handler.ScopedHandler.handle (scopedhandler.java:137)
At Org.eclipse.jetty.security.SecurityHandler.handle (securityhandler.java:557)
At Org.eclipse.jetty.server.session.SessionHandler.doHandle (sessionhandler.java:231)
At Org.eclipse.jetty.server.handler.ContextHandler.doHandle (contexthandler.java:1075)
At Org.eclipse.jetty.servlet.ServletHandler.doScope (servlethandler.java:384)
At Org.eclipse.jetty.server.session.SessionHandler.doScope (sessionhandler.java:193)
At Org.eclipse.jetty.server.handler.ContextHandler.doScope (contexthandler.java:1009)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Apache Nutch 1.7 + SOLR 4.4.0 installation notes in the CentOS 6.4 environment

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Apache Nutch 1.7 + SOLR 4.4.0 installation notes in the CentOS 6.4 environment

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support