"Not perfect" use the Nutch command to progressively download Web pages

Last Update:2014-07-08 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is not perfect. Whether you can use Nutch to step through the download, unknown.

1, basic operation, build the environment

(1) Download the software installation package and unzip to/USR/SEARCH/APACHE-NUTCH-2.2.1/

(2) Build runtime

Cd/usr/search/apache-nutch-2.2.1/

Ant Runtime

(3) Verify Nutch installation is complete

[Email protected] apache-nutch-2.2.1]# cd/usr/search/apache-nutch-2.2.1/runtime/local/bin/
[Email protected] bin]#./nutch
Usage:nutch COMMAND
Where COMMAND is one of:
Inject inject new URLs into the database
Hostinject creates or updates an existing host table from a text file
Generate generate new batches to fetch from crawl db
Fetch fetch URLs marked during generate
Parse parse URLs marked during fetch
UpdateDB Update Web table after parsing
UPDATEHOSTDB Update host table after parsing
READDB read/dump records from page database
READHOSTDB display entries from the Hostdb
Elasticindex Run the Elasticsearch indexer
Solrindex Run the SOLR indexer on parsed batches
Solrdedup Remove Duplicates from SOLR
Parsechecker Check the parser for a given URL
Indexchecker Check the indexing filters for a given URL
Plugin load a plugin and run one of its classes main ()
Nutchserver run a (local) Nutch server on a user defined port
JUnit runs the given JUnit test
Or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

(4) Vi/usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml Add Search task

<property><name>http.agent.name</name><value>my Nutch Spider</value></property >

(5) Create Seed.txt

cd/usr/search/apache-nutch-2.2.1/runtime/local/bin/

VI Seed.txt

http://nutch.apache.org/

(6) Modify the Web filter vi/usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

Vi/usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

Will

# Accept anything else
+.

Revision changed to

# Accept anything else
+^http://([a-z0-9]*\.) *nutch.apache.org/

When a user invokes a crawling command on Apache Nutch 1.x, Crawldb isgenerated by Apache Nutch which was nothing but a dir Ectory and which containsdetails about crawling. In Apache 2.x, crawldb are not present. Instead, Apachenutch keeps all the crawling data directly in the database. In our case, we are Usedapache hbase, so all crawling data would go inside Apache HBase.
2 Injectjob
[Email protected] local]#./bin/nutch inject URLsInjectorjob:starting at 2014-07-07 14:15:21injectorjob:injecting urlDir:urlsInjectorJob:Using class Org.apache.gora.memory.store.MemStore as the Gora storage class. Injectorjob:total number of URLs rejected by filters:0injectorjob:total number of URLs injected after normalization and Filtering:2Injector:finished at 2014-07-07 14:15:24, elapsed:00:00:03
3 generatejob[[email protected] local]#/bin/nutch generateusage:generatorjob [-topN N] [-crawlId ID] [- NoFilter] [-nonorm] [-adddays numdays] -TOPN <N>-Number of top URLs to being selected, default is LONG.M ax_value -crawlid <id>-The ID to prefix the schemas to operate on, (default:storage.crawl.id); -nofilter-do Not activate the filter plug In to filter the URL, the default was true -nonorm-do not activate the Normalizer plugin to normalize the URL, D Efault is true -adddays-adds numdays to the current time to facilitate crawling URLs already & nbsp fetched sooner then Db.fetch.interval.default. Default value is 0. -batchid-the batch ID----------------------set the params. [[email protected] local]#./bin/nutch GENERATE-TOPN 3GenerAtorjob:starting at 2014-07-07 14:22:55generatorjob:selecting best-scoring URLs due for fetch. Generatorjob:startinggeneratorjob:filtering:truegeneratorjob:normalizing:truegeneratorjob:topn:3generatorjob: Finished at 2014-07-07 14:22:58, time elapsed:00:00:03generatorjob:generated batch id:1404714175-1017128204
4 fetcherjobthe job of the fetcher is to fetch the URLs which be generated by the generatorjob.it would use the input prov IDed by Generatorjob. The following command would beused for the fetcherjob:
[[email protected] local]# bin/nutch fetch–allfetcherjob:startingfetcherjob:batchid:–allfetcher:your ' Http.agent.name ' value should is listed first in ' Http.robots.agents ' property. Fetcherjob:threads:10fetcherjob:parsing:falsefetcherjob:resuming:falsefetcherjob:timelimit set for: -1Using queue Mode:byhostfetcher:threads:10queuefeeder finished:total 0 Records. Hit by time limit:0-finishing thread FetcherThread0, activethreads=0-finishing thread FetcherThread1, activethreads=0- Finishing thread FetcherThread2, activethreads=0-finishing thread FetcherThread3, activethreads=0-finishing thread FETCHERTHREAD4, activethreads=0-finishing thread FetcherThread5, activethreads=0-finishing thread FetcherThread6, Activethreads=0-finishing thread FetcherThread7, activethreads=1-finishing thread FetcherThread8, activethreads= 0fetcher:throughput threshold: -1fetcher:throughput threshold sequence:5-finishing Thread FetcherThread9, activethreads=00/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues-activethreads=0fetcherjob:done
Here I had provided input parameters-this means that this job would fetch allthe URLs that is generated by the Generatorj Mbl You can use the different inputparameters according to your needs.
5 Parserjobafter The Fetcherjob, the Parserjob is-to-parse the URLs that was fetched byfetcherjob. The following command is used for the parserjob:
[Email protected] local]# Bin/nutch parse–allparserjob:startingparserjob:resuming:falseparserjob:forced REPARSE:FA Lseparserjob:batchid:–allparserjob:success[[email protected] local]#
I have used input parameters-all of which to parse all the URLs fetched by Thefetcherjob. You can use different input parameters according to your needs.
6 Dbupdaterjob[[email protected] local]#./bin/nutch updatedb

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More