"Not perfect" use the Nutch command to progressively download Web pages

Source: Internet
Author: User
Tags solr

This article is not perfect. Whether you can use Nutch to step through the download, unknown.

1, basic operation, build the environment

(1) Download the software installation package and unzip to/USR/SEARCH/APACHE-NUTCH-2.2.1/

(2) Build runtime

Cd/usr/search/apache-nutch-2.2.1/

Ant Runtime

(3) Verify Nutch installation is complete

[Email protected] apache-nutch-2.2.1]# cd/usr/search/apache-nutch-2.2.1/runtime/local/bin/
[Email protected] bin]#./nutch
Usage:nutch COMMAND
Where COMMAND is one of:
Inject inject new URLs into the database
Hostinject creates or updates an existing host table from a text file
Generate generate new batches to fetch from crawl db
Fetch fetch URLs marked during generate
Parse parse URLs marked during fetch
UpdateDB Update Web table after parsing
UPDATEHOSTDB Update host table after parsing
READDB read/dump records from page database
READHOSTDB display entries from the Hostdb
Elasticindex Run the Elasticsearch indexer
Solrindex Run the SOLR indexer on parsed batches
Solrdedup Remove Duplicates from SOLR
Parsechecker Check the parser for a given URL
Indexchecker Check the indexing filters for a given URL
Plugin load a plugin and run one of its classes main ()
Nutchserver run a (local) Nutch server on a user defined port
JUnit runs the given JUnit test
Or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

(4) Vi/usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml Add Search task

<property><name>http.agent.name</name><value>my Nutch Spider</value></property >

(5) Create Seed.txt

cd/usr/search/apache-nutch-2.2.1/runtime/local/bin/

VI Seed.txt

http://nutch.apache.org/

(6) Modify the Web filter vi/usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

Vi/usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

Will

# Accept anything else
+.

Revision changed to

# Accept anything else
+^http://([a-z0-9]*\.) *nutch.apache.org/


When a user invokes a crawling command on Apache Nutch 1.x, Crawldb isgenerated by Apache Nutch which was nothing but a dir Ectory and which containsdetails about crawling. In Apache 2.x, crawldb are not present. Instead, Apachenutch keeps all the crawling data directly in the database. In our case, we are Usedapache hbase, so all crawling data would go inside Apache HBase.
2 Injectjob
[Email protected] local]#./bin/nutch inject URLsInjectorjob:starting at 2014-07-07 14:15:21injectorjob:injecting urlDir:urlsInjectorJob:Using class Org.apache.gora.memory.store.MemStore as the Gora storage class. Injectorjob:total number of URLs rejected by filters:0injectorjob:total number of URLs injected after normalization and Filtering:2Injector:finished at 2014-07-07 14:15:24, elapsed:00:00:03
3  generatejob[[email protected] local]#/bin/nutch generateusage:generatorjob [-topN N] [-crawlId ID] [- NoFilter] [-nonorm] [-adddays numdays]   -TOPN <N>-Number of top URLs to being selected, default is LONG.M ax_value   -crawlid <id>-The ID to prefix the schemas to operate on,                    (default:storage.crawl.id);   -nofilter-do Not activate the filter plug In to filter the URL, the default was true   -nonorm-do not activate the Normalizer plugin to normalize the URL, D Efault is true   -adddays-adds numdays to the current time to facilitate crawling URLs already    & nbsp                fetched sooner then Db.fetch.interval.default. Default value is 0.   -batchid-the batch ID----------------------set the params. [[email protected] local]#./bin/nutch GENERATE-TOPN 3GenerAtorjob:starting at 2014-07-07 14:22:55generatorjob:selecting best-scoring URLs due for fetch. Generatorjob:startinggeneratorjob:filtering:truegeneratorjob:normalizing:truegeneratorjob:topn:3generatorjob: Finished at 2014-07-07 14:22:58, time elapsed:00:00:03generatorjob:generated batch id:1404714175-1017128204
4 fetcherjobthe job of the fetcher is to fetch the URLs which be generated by the generatorjob.it would use the input prov IDed by Generatorjob. The following command would beused for the fetcherjob:
[[email protected] local]# bin/nutch fetch–allfetcherjob:startingfetcherjob:batchid:–allfetcher:your ' Http.agent.name ' value should is listed first in ' Http.robots.agents ' property. Fetcherjob:threads:10fetcherjob:parsing:falsefetcherjob:resuming:falsefetcherjob:timelimit set for: -1Using queue Mode:byhostfetcher:threads:10queuefeeder finished:total 0 Records. Hit by time limit:0-finishing thread FetcherThread0, activethreads=0-finishing thread FetcherThread1, activethreads=0- Finishing thread FetcherThread2, activethreads=0-finishing thread FetcherThread3, activethreads=0-finishing thread FETCHERTHREAD4, activethreads=0-finishing thread FetcherThread5, activethreads=0-finishing thread FetcherThread6, Activethreads=0-finishing thread FetcherThread7, activethreads=1-finishing thread FetcherThread8, activethreads= 0fetcher:throughput threshold: -1fetcher:throughput threshold sequence:5-finishing Thread FetcherThread9, activethreads=00/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues-activethreads=0fetcherjob:done
Here I had provided input parameters-this means that this job would fetch allthe URLs that is generated by the Generatorj Mbl You can use the different inputparameters according to your needs.
5 Parserjobafter The Fetcherjob, the Parserjob is-to-parse the URLs that was fetched byfetcherjob. The following command is used for the parserjob:
[Email protected] local]# Bin/nutch parse–allparserjob:startingparserjob:resuming:falseparserjob:forced REPARSE:FA Lseparserjob:batchid:–allparserjob:success[[email protected] local]#
I have used input parameters-all of which to parse all the URLs fetched by Thefetcherjob. You can use different input parameters according to your needs.
6 Dbupdaterjob[[email protected] local]#./bin/nutch updatedb



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.