本人原創,轉載請註明出處:http://blog.csdn.net/panjunbiao/article/details/12171147
Nutch安裝 參考文檔:http://wiki.apache.org/nutch/NutchTutorial
安裝必要程式:
yum update
yum list java*
yum install java-1.7.0-openjdk-devel.x86_64
找到java的安裝路徑:
參考:http://serverfaullt.com/questions/50883/what-is-the-value-of-java-home-for-centos
設定JAVA_HOME:
參考:http://www.cnblogs.com/zhoulf/archive/2013/02/04/2891608.html
vi + /etc/profile
JAVA_HOME=/usr/lib/jvm/java JRE_HOME=/usr/lib/jvm/java/jre PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib export JAVA_HOME JRE_HOME PATH CLASSPATH |
使profile檔案立即生效:
source /etc/profile
下載二進位包檔案:
curl -O http://apache.fayea.com/apache-mirror/nutch/1.7/apache-nutch-1.7-bin.tar.gz
解包:
tar -xvzf apache-nutch-1.7-bin.tar.gz
檢驗運行檔案
cd apache-nutch-1.7
bin/nutch
此時會出現用法協助,表示安裝成功了。
修改檔案conf/nutch-site.xml,設定HTTP請求中agent的名字:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>Friendly Crawler</value> </property> </configuration> |
建立種子檔案夾
mkdir -p urls
執行第一次爬蟲任務:
bin/nutch crawl urls -dir crawl
solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 5 solrUrl=null Injector: starting at 2013-09-29 12:01:30 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 0 Injector: Merging injected urls into crawl db. Injector: finished at 2013-09-29 12:01:33, elapsed: 00:00:03 Generator: starting at 2013-09-29 12:01:33 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=0 - no more URLs to fetch. No URLs to fetch - check your seed list and URL filters. crawl finished: crawl |
由於沒有設定任何種子URL,所以爬蟲什麼都不做就退出了。
將種子URL寫到檔案urls/seed.txt中:
vi conf/regex-urlfilter.txt
# accept anything else # +.
# added by panjunbiao +36kr.com |
再次執行爬蟲程式,發現有些種子網站被skip了:
bin/nutch crawl urls -dir crawl
solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 5 solrUrl=null Injector: starting at 2013-09-29 12:10:24 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Injector: finished at 2013-09-29 12:10:27, elapsed: 00:00:03 Generator: starting at 2013-09-29 12:10:27 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20130929121029 Generator: finished at 2013-09-29 12:10:30, elapsed: 00:00:03 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2013-09-29 12:10:30 Fetcher: segment: crawl/segments/20130929121029 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 fetching http://www.36kr.com/ (queue crawl delay=5000ms) -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2013-09-29 12:10:32, elapsed: 00:00:02 ParseSegment: starting at 2013-09-29 12:10:32 ParseSegment: segment: crawl/segments/20130929121029 http://www.36kr.com/ skipped. Content of size 67099 was truncated to 59363 ParseSegment: finished at 2013-09-29 12:10:33, elapsed: 00:00:01 CrawlDb update: starting at 2013-09-29 12:10:33 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20130929121029] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2013-09-29 12:10:34, elapsed: 00:00:01 Generator: starting at 2013-09-29 12:10:34 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2013-09-29 12:10:35 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: internal links will be ignored. LinkDb: adding segment: file:/root/apache-nutch-1.7/crawl/segments/20130929121029 LinkDb: finished at 2013-09-29 12:10:36, elapsed: 00:00:01 crawl finished: crawl |
為什麼呢。用tcpdump或者wireshark抓包發現,該網站的頁面內容採用truncate的方式分段返回,而nutch的預設設定是不處理這種方式的,需要開啟之,修改conf/nutch-site.xml,在裡面增加一個 parser.skip.truncated 屬性:
<property>
<name>parser.skip.truncated</name>
<value>false</value>
</property>
參考:http://lucene.472066.n3.nabble.com/Content-Truncation-in-Nutch-2-1-MySQL-td4038888.html
修改後再次執行爬蟲任務,已經能夠正常抓取了:
bin/nutch crawl urls -dir crawl
Solr安裝 下載安裝檔案
curl -O http://mirrors.cnnic.cn/apache/lucene/solr/4.4.0/solr-4.4.0.tgz
tar -xvzf solr-4.4.0.tgz
cd solr-4.4.0/example
java -jar start.jar
驗證Solr安裝(假設安裝在本機)
http://localhost:8983/solr/
整合Nutch與Solr vi + /etc/profile
| NUTCH_RUNTIME_HOME=/root/apache-nutch-1.7APACHE_SOLR_HOME=/root/solr-4.4.0export JAVA_HOME JRE_HOME PATH CLASSPATH NUTCH_RUNTIME_HOME APACHE_SOLR_HOME |
source /etc/profile
mkdir ${APACHE_SOLR_HOME}/example/solr/conf
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/
重新啟動solr的start程式
java -jar start.jar
建立索引:
bin/nutch crawl urls -dir crawl -depth 2 -topN 5 -solr http://localhost:8983/solr/
索引出錯:
Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication
Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:81) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:65) at org.apache.nutch.crawl.Crawl.run(Crawl.java:155) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) |
檢查Solr的日誌:
2859895 [qtp1478922764-16] INFO org.apache.solr.update.processor.LogUpdateProcessor ? [collection1] webapp=/solr path=/update params={wt=javabin&version=2} {} 0 1 2859902 [qtp1478922764-16] ERROR org.apache.solr.core.SolrCore ? org.apache.solr.common.SolrException: ERROR: [doc=http://www.36kr.com/] unknown field 'host' at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:174) at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:73) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:210) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:556) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:692) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:435) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
|