標籤:style blog http color 使用 os
此文未完善。是否可以使用nutch逐步下載,未知。
1、基本操作,構建環境
(1)下載軟體安裝包,並解壓至/usr/search/apache-nutch-2.2.1/
(2)構建runtime
cd /usr/search/apache-nutch-2.2.1/
ant runtime
(3)驗證Nutch安裝完成
[[email protected] apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[[email protected] bin]# ./nutch
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
elasticindex run the elasticsearch indexer
solrindex run the solr indexer on parsed batches
solrdedup remove duplicates from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
(4)vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜尋任務
<property><name>http.agent.name</name><value>My Nutch Spider</value></property>
(5)建立seed.txt
cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
vi seed.txt
http://nutch.apache.org/
(6)修改網頁過濾器 vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt
vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt
將
# accept anything else
+.
修改為
# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/
When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB isgenerated by Apache Nutch which is nothing but a directory and which containsdetails about crawling. In Apache 2.x, CrawlDB is not present. Instead, ApacheNutch keeps all the crawling data directly in the database. In our case, we have usedApache HBase, so all crawling data would go inside Apache HBase.
2 injectJob
[[email protected] local]# ./bin/nutch inject urlsInjectorJob: starting at 2014-07-07 14:15:21InjectorJob: Injecting urlDir: urlsInjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.InjectorJob: total number of urls rejected by filters: 0InjectorJob: total number of urls injected after normalization and filtering: 2Injector: finished at 2014-07-07 14:15:24, elapsed: 00:00:03
3 GenerateJob[[email protected] local]# ./bin/nutch generateUsage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays] -topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE -crawlId <id> - the id to prefix the schemas to operate on, (default: storage.crawl.id)"); -noFilter - do not activate the filter plugin to filter the url, default is true -noNorm - do not activate the normalizer plugin to normalize the url, default is true -adddays - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. Default value is 0. -batchId - the batch id----------------------Please set the params.[[email protected] local]# ./bin/nutch generate -topN 3GeneratorJob: starting at 2014-07-07 14:22:55GeneratorJob: Selecting best-scoring urls due for fetch.GeneratorJob: startingGeneratorJob: filtering: trueGeneratorJob: normalizing: trueGeneratorJob: topN: 3GeneratorJob: finished at 2014-07-07 14:22:58, time elapsed: 00:00:03GeneratorJob: generated batch id: 1404714175-1017128204
4 FetcherJobThe job of the fetcher is to fetch the URLs which are generated by the GeneratorJob.It will use the input provided by GeneratorJob. The following command will beused for the FetcherJob:
[[email protected] local]# bin/nutch fetch –allFetcherJob: startingFetcherJob: batchId: –allFetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.FetcherJob: threads: 10FetcherJob: parsing: falseFetcherJob: resuming: falseFetcherJob : timelimit set for : -1Using queue mode : byHostFetcher: threads: 10QueueFeeder finished: total 0 records. Hit by time limit :0-finishing thread FetcherThread0, activeThreads=0-finishing thread FetcherThread1, activeThreads=0-finishing thread FetcherThread2, activeThreads=0-finishing thread FetcherThread3, activeThreads=0-finishing thread FetcherThread4, activeThreads=0-finishing thread FetcherThread5, activeThreads=0-finishing thread FetcherThread6, activeThreads=0-finishing thread FetcherThread7, activeThreads=1-finishing thread FetcherThread8, activeThreads=0Fetcher: throughput threshold: -1Fetcher: throughput threshold sequence: 5-finishing thread FetcherThread9, activeThreads=00/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues-activeThreads=0FetcherJob: done
Here I have provided input parameters—this means that this job will fetch allthe URLs that are generated by the GeneratorJob. You can use different inputparameters according to your needs.
5 ParserJobAfter the FetcherJob, the ParserJob is to parse the URLs that are fetched byFetcherJob. The following command will be used for the ParserJob:
[[email protected] local]# bin/nutch parse –allParserJob: startingParserJob: resuming: falseParserJob: forced reparse: falseParserJob: batchId: –allParserJob: success[[email protected] local]#
I have used input parameters—all of which will parse all the URLs fetched by theFetcherJob. You can use different input parameters according to your needs.
6 DbUpdaterJob[[email protected] local]# ./bin/nutch updatedb