【未完善】使用nutch命令逐步下載網頁

來源:互聯網
上載者:User

標籤:style   blog   http   color   使用   os   

此文未完善。是否可以使用nutch逐步下載,未知。

1、基本操作,構建環境

(1)下載軟體安裝包,並解壓至/usr/search/apache-nutch-2.2.1/

(2)構建runtime

 cd /usr/search/apache-nutch-2.2.1/

ant runtime

(3)驗證Nutch安裝完成

[[email protected] apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[[email protected] bin]# ./nutch 
Usage: nutch COMMAND
where COMMAND is one of:
 inject         inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate       generate new batches to fetch from crawl db
 fetch          fetch URLs marked during generate
 parse          parse URLs marked during fetch
 updatedb       update web table after parsing
 updatehostdb   update host table after parsing
 readdb         read/dump records from page database
 readhostdb     display entries from the hostDB
 elasticindex   run the elasticsearch indexer
 solrindex      run the solr indexer on parsed batches
 solrdedup      remove duplicates from solr
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin         load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 junit          runs the given JUnit test
 or
 CLASSNAME      run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

(4)vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜尋任務

<property><name>http.agent.name</name><value>My Nutch Spider</value></property>

(5)建立seed.txt

 cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

vi seed.txt

http://nutch.apache.org/

(6)修改網頁過濾器  vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt 

 vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt 

# accept anything else
+.

修改為

# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/


When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB isgenerated by Apache Nutch which is nothing but a directory and which containsdetails about crawling. In Apache 2.x, CrawlDB is not present. Instead, ApacheNutch keeps all the crawling data directly in the database. In our case, we have usedApache HBase, so all crawling data would go inside Apache HBase. 
2 injectJob
[[email protected] local]# ./bin/nutch inject urlsInjectorJob: starting at 2014-07-07 14:15:21InjectorJob: Injecting urlDir: urlsInjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.InjectorJob: total number of urls rejected by filters: 0InjectorJob: total number of urls injected after normalization and filtering: 2Injector: finished at 2014-07-07 14:15:24, elapsed: 00:00:03
3  GenerateJob[[email protected] local]# ./bin/nutch generateUsage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]    -topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE    -crawlId <id> - the id to prefix the schemas to operate on,                    (default: storage.crawl.id)");    -noFilter - do not activate the filter plugin to filter the url, default is true    -noNorm - do not activate the normalizer plugin to normalize the url, default is true    -adddays - Adds numDays to the current time to facilitate crawling urls already                     fetched sooner then db.fetch.interval.default. Default value is 0.    -batchId - the batch id----------------------Please set the params.[[email protected] local]# ./bin/nutch generate -topN 3GeneratorJob: starting at 2014-07-07 14:22:55GeneratorJob: Selecting best-scoring urls due for fetch.GeneratorJob: startingGeneratorJob: filtering: trueGeneratorJob: normalizing: trueGeneratorJob: topN: 3GeneratorJob: finished at 2014-07-07 14:22:58, time elapsed: 00:00:03GeneratorJob: generated batch id: 1404714175-1017128204
4 FetcherJobThe job of the fetcher is to fetch the URLs which are generated by the GeneratorJob.It will use the input provided by GeneratorJob. The following command will beused for the FetcherJob:
[[email protected] local]# bin/nutch fetch –allFetcherJob: startingFetcherJob: batchId: –allFetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.FetcherJob: threads: 10FetcherJob: parsing: falseFetcherJob: resuming: falseFetcherJob : timelimit set for : -1Using queue mode : byHostFetcher: threads: 10QueueFeeder finished: total 0 records. Hit by time limit :0-finishing thread FetcherThread0, activeThreads=0-finishing thread FetcherThread1, activeThreads=0-finishing thread FetcherThread2, activeThreads=0-finishing thread FetcherThread3, activeThreads=0-finishing thread FetcherThread4, activeThreads=0-finishing thread FetcherThread5, activeThreads=0-finishing thread FetcherThread6, activeThreads=0-finishing thread FetcherThread7, activeThreads=1-finishing thread FetcherThread8, activeThreads=0Fetcher: throughput threshold: -1Fetcher: throughput threshold sequence: 5-finishing thread FetcherThread9, activeThreads=00/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues-activeThreads=0FetcherJob: done
Here I have provided input parameters—this means that this job will fetch allthe URLs that are generated by the GeneratorJob. You can use different inputparameters according to your needs.
5 ParserJobAfter the FetcherJob, the ParserJob is to parse the URLs that are fetched byFetcherJob. The following command will be used for the ParserJob:
[[email protected] local]# bin/nutch parse –allParserJob: startingParserJob: resuming: falseParserJob: forced reparse: falseParserJob: batchId: –allParserJob: success[[email protected] local]# 
I have used input parameters—all of which will parse all the URLs fetched by theFetcherJob. You can use different input parameters according to your needs.
6 DbUpdaterJob[[email protected] local]# ./bin/nutch updatedb



相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.