CentOS 6.4環境下的Apache Nutch 1.7 + Solr 4.4.0安裝筆記

最後更新：2018-07-20 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

本人原創，轉載請註明出處：http://blog.csdn.net/panjunbiao/article/details/12171147

Nutch安裝 參考文檔：http://wiki.apache.org/nutch/NutchTutorial

安裝必要程式：
yum update
yum list java*
yum install java-1.7.0-openjdk-devel.x86_64

找到java的安裝路徑：
參考：http://serverfaullt.com/questions/50883/what-is-the-value-of-java-home-for-centos
設定JAVA_HOME：
參考：http://www.cnblogs.com/zhoulf/archive/2013/02/04/2891608.html

vi + /etc/profile

JAVA_HOME=/usr/lib/jvm/java
JRE_HOME=/usr/lib/jvm/java/jre
PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
export JAVA_HOME JRE_HOME PATH CLASSPATH

使profile檔案立即生效：
source /etc/profile

下載二進位包檔案：
curl -O http://apache.fayea.com/apache-mirror/nutch/1.7/apache-nutch-1.7-bin.tar.gz

解包：
tar -xvzf apache-nutch-1.7-bin.tar.gz

檢驗運行檔案
cd apache-nutch-1.7
bin/nutch
此時會出現用法協助，表示安裝成功了。

修改檔案conf/nutch-site.xml，設定HTTP請求中agent的名字：

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>http.agent.name</name>
<value>Friendly Crawler</value>
</property>
</configuration>

建立種子檔案夾
mkdir -p urls

執行第一次爬蟲任務：
bin/nutch crawl urls -dir crawl

solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=null
Injector: starting at 2013-09-29 12:01:30
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-09-29 12:01:33, elapsed: 00:00:03
Generator: starting at 2013-09-29 12:01:33
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

由於沒有設定任何種子URL，所以爬蟲什麼都不做就退出了。

將種子URL寫到檔案urls/seed.txt中：

http://www.36kr.com/

vi conf/regex-urlfilter.txt

# accept anything else
# +.

# added by panjunbiao
+36kr.com

再次執行爬蟲程式，發現有些種子網站被skip了：
bin/nutch crawl urls -dir crawl

solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=null
Injector: starting at 2013-09-29 12:10:24
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-09-29 12:10:27, elapsed: 00:00:03
Generator: starting at 2013-09-29 12:10:27
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20130929121029
Generator: finished at 2013-09-29 12:10:30, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-09-29 12:10:30
Fetcher: segment: crawl/segments/20130929121029
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetching http://www.36kr.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-09-29 12:10:32, elapsed: 00:00:02
ParseSegment: starting at 2013-09-29 12:10:32
ParseSegment: segment: crawl/segments/20130929121029
http://www.36kr.com/ skipped. Content of size 67099 was truncated to 59363
ParseSegment: finished at 2013-09-29 12:10:33, elapsed: 00:00:01
CrawlDb update: starting at 2013-09-29 12:10:33
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20130929121029]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-09-29 12:10:34, elapsed: 00:00:01
Generator: starting at 2013-09-29 12:10:34
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2013-09-29 12:10:35
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/root/apache-nutch-1.7/crawl/segments/20130929121029
LinkDb: finished at 2013-09-29 12:10:36, elapsed: 00:00:01
crawl finished: crawl

為什麼呢。用tcpdump或者wireshark抓包發現，該網站的頁面內容採用truncate的方式分段返回，而nutch的預設設定是不處理這種方式的，需要開啟之，修改conf/nutch-site.xml，在裡面增加一個 parser.skip.truncated 屬性:
<property>
<name>parser.skip.truncated</name>
<value>false</value>
</property>
參考：http://lucene.472066.n3.nabble.com/Content-Truncation-in-Nutch-2-1-MySQL-td4038888.html

修改後再次執行爬蟲任務，已經能夠正常抓取了：
bin/nutch crawl urls -dir crawl

Solr安裝 下載安裝檔案
curl -O http://mirrors.cnnic.cn/apache/lucene/solr/4.4.0/solr-4.4.0.tgz

tar -xvzf solr-4.4.0.tgz

cd solr-4.4.0/example
java -jar start.jar

驗證Solr安裝（假設安裝在本機）
http://localhost:8983/solr/

整合Nutch與Solr vi + /etc/profile

NUTCH_RUNTIME_HOME=/root/apache-nutch-1.7APACHE_SOLR_HOME=/root/solr-4.4.0export JAVA_HOME JRE_HOME PATH CLASSPATH NUTCH_RUNTIME_HOME APACHE_SOLR_HOME

source /etc/profile

mkdir ${APACHE_SOLR_HOME}/example/solr/conf
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/

重新啟動solr的start程式
java -jar start.jar

建立索引：
bin/nutch crawl urls -dir crawl -depth 2 -topN 5 -solr http://localhost:8983/solr/
索引出錯：

Active IndexWriters :
SOLRIndexWriter
     solr.server.url : URL of the SOLR instance (mandatory)
     solr.commit.size : buffer size when sending to SOLR (default 1000)
     solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
     solr.auth : use authentication (default false)
     solr.auth.username : use authentication (default false)
     solr.auth : username for authentication
     solr.auth.password : password for authentication

Exception in thread "main" java.io.IOException: Job failed!
     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
     at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
     at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:81)
     at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:65)
     at org.apache.nutch.crawl.Crawl.run(Crawl.java:155)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

檢查Solr的日誌：

2859895 [qtp1478922764-16] INFO org.apache.solr.update.processor.LogUpdateProcessor ? [collection1] webapp=/solr path=/update params={wt=javabin&version=2} {} 0 1
2859902 [qtp1478922764-16] ERROR org.apache.solr.core.SolrCore ? org.apache.solr.common.SolrException: ERROR: [doc=http://www.36kr.com/] unknown field 'host'
     at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:174)
     at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:73)
     at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:210)
     at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
     at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
     at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:556)
     at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:692)
     at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:435)
     at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
     at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
     at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
     at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
     at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
     at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
     at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
     at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
     at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
     at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
     at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
     at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
     at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
     at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
     at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
     at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
     at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
     at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

CentOS 6.4環境下的Apache Nutch 1.7 + Solr 4.4.0安裝筆記

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support