Nutch 1.3 Study Notes 1

Last Update:2018-12-04 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Nutch 1.3 Study Notes 1
--------------------
1. What is nutch?

As an open-source web page capturing tool, nutch is mainly used to collect web page data, analyze it, and create indexes to provide corresponding interfaces for querying its web page data. The underlying layer uses hadoop for Distributed Computing and storage, and the index uses the SOLR distributed index framework. SOLR is an open-source full-text index framework, which has been integrated with the index architecture since the beginning of nutch 1.3.

2. Where can I download the latest nutch?

Download the latest nutch 1.3 Binary Package and source code from the address below
Http://mirror.bjtu.edu.cn/apache//nutch/

3. How do I configure nutch? 3.1 decompress the downloaded compressed package, and then CD $ home/nutch-1.3/runtime/local 3.2 configure the bin/nutch file permissions, use chmod + x bin/nutch 3.3 to configure java_home, use Export java_home = $ path4. what preparations should I do before capturing?

4.1 configure the HTTP. Agent. Name attribute as follows:

<property> <name>http.agent.name</name> <value>My Nutch Spider</value></property>

4.2 create an address directory, mkdir-P URLs

Create a URL file in this directory and write some URLs, such

http://nutch.apache.org/

4.3 run the following command:

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

Note that no index is provided here. If you want to create an index for the captured data, run the following command:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

5. What does the crawling process look like? 5.1 initialize crawldb and inject the initial URL

bin/nutch inject Usage: Injector <crawldb> <url_dir>

The output result after running this command locally is as follows:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch inject db/crawldb urls/Injector: starting at 2011-08-22 10:50:01Injector: crawlDb: db/crawldbInjector: urlDir: urlsInjector: Converting injected urls to crawl db entries.Injector: Merging injected urls into crawl db.Injector: finished at 2011-08-22 10:50:05, elapsed: 00:00:03

5.2 generate new URLs

bin/nutch generateUsage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]

The output result of the local machine is as follows:

Lemo @ Debian :~ /Workspace/Java/Apache/nutch/nutch-1.3 $ bin/nutch generate dB/crawldb/DB/segmentsgenerator: starting at 10: 52: 41 generator: selecting best-scoring URLs due for Fetch. generator: Filtering: truegenerator: normalizing: truegenerator: jobtracker is 'local', generating exactly one partition. generator: partitioning selected URLs for politeness. generator: Segment: DB/segments/20110822105243 // here a new segmentgenerator: finished at 10:52:44, elapsed: 00:00:03

5.3 capture the URL generated above

bin/nutch fetchUsage: Fetcher <segment> [-threads n] [-noParsing]

Here is the local output result:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch fetch db/segments/20110822105243/Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.Fetcher: starting at 2011-08-22 10:56:07Fetcher: segment: db/segments/20110822105243Fetcher: threads: 10QueueFeeder finished: total 1 records + hit by time limit :0fetching http://www.baidu.com/-finishing thread FetcherThread, activeThreads=1-finishing thread FetcherThread, activeThreads=-finishing thread FetcherThread, activeThreads=1-finishing thread FetcherThread, activeThreads=1-finishing thread FetcherThread, activeThreads=0-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0-activeThreads=0Fetcher: finished at 2011-08-22 10:56:09, elapsed: 00:00:02

Let's take a look at the Segment directory structure here.

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ ls db/segments/20110822105243/content  crawl_fetch  crawl_generate

5.4 parse the above results

bin/nutch parseUsage: ParseSegment segment

Local output result:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch parse db/segments/20110822105243/ParseSegment: starting at 2011-08-22 10:58:19ParseSegment: segment: db/segments/20110822105243ParseSegment: finished at 2011-08-22 10:58:22, elapsed: 00:00:02

Let's take a look at the parsed directory structure.

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ ls db/segments/20110822105243/content  crawl_fetch  crawl_generate  crawl_parse  parse_data  parse_text

Three parsed directories are added here.

5.5 update external link database

bin/nutch updatedbUsage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions]

Local output result:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch updatedb db/crawldb/ -dir db/segments/CrawlDb update: starting at 2011-08-22 11:00:09CrawlDb update: db: db/crawldbCrawlDb update: segments: CrawlDb update: additions allowed: trueCrawlDb update: URL normalizing: falseCrawlDb update: URL filtering: falseCrawlDb update: Merging segment data into db.CrawlDb update: finished at 2011-08-22 11:00:10, elapsed: 00:00:01

At this time, it will update the Linked Library of the crawler LDB, which is stored in the file system. For example, the Linked Library of the Taobao crawling program is made of redis, a key-value nosql database.

5.6 calculate the reverse link

bin/nutch invertlinksUsage: LinkDb <linkdb> (-dir <segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize] [-noFilter]

Local output result:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch invertlinks db/linkdb -dir db/segments/LinkDb: starting at 2011-08-22 11:02:49LinkDb: linkdb: db/linkdbLinkDb: URL normalize: trueLinkDb: URL filter: trueLinkDb: adding segment: file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243LinkDb: finished at 2011-08-22 11:02:50, elapsed: 00:00:01

5.7 use SOLR to create an index for captured content

bin/nutch solrindexUsage: SolrIndexer <solr url> <crawldb> <linkdb> (<segment> ... | -dir <segments>

The output of the nutch end is as follows:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch solrindex http://127.0.0.1:8983/solr/ db/crawldb/ db/linkdb/ db/segments/*

SolrIndexer: starting at 2011-08-22 11:05:33

SolrIndexer: finished at 2011-08-22 11:05:35, elapsed: 00:00:02

Some SOLR outputs are as follows:

INFO: SolrDeletionPolicy.onInit: commits:num=1        commit{dir=/home/lemo/Workspace/java/Apache/Solr/apache-solr-3.3.0/example/solr/data/index,segFN=segments_1,version=1314024228223,generation=1,filenames=[segments_1]Aug 22, 2011 11:05:35 AM org.apache.solr.core.SolrDeletionPolicy updateCommitsINFO: newest commit = 1314024228223Aug 22, 2011 11:05:35 AM org.apache.solr.update.processor.LogUpdateProcessor finishINFO: {add=[http://www.baidu.com/]} 0 183Aug 22, 2011 11:05:35 AM org.apache.solr.core.SolrCore executeINFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=0 QTime=183Aug 22, 2011 11:05:35 AM org.apache.solr.update.DirectUpdateHandler2 commitINFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)

5.8 query on SOLR Client
Enter

http://localhost:8983/solr/admin/

The query condition is Baidu.

The output XML structure is

If you want to display the content in SOLR configuration file solrconfig. XML in the HTML structure as follows, you can
<Field name = "content" type = "text" stored = "true" indexed = "true"/>

<Response> <lst name = "responseheader"> <int name = "status"> 0 </int> <int name = "qtime"> 0 </int> <lst name = "Params"> <STR name = "indent"> On </STR> <STR name = "start"> 0 </STR> <STR name = "Q"> Baidu </STR> <STR name = "version"> 2.2 </STR> <STR name = "rows"> 10 </STR> </lst> <result name = "response" numfound = "1" Start = "0"> <Doc> <float name = "Boost"> 1.0660036 </float> <STR name = "Digest"> 7be5cfd6da4a058001300b21d7d96b0f </STR> <STR name = "ID"> http://www.baidu.com/</STR> <STR name = "segment"> 20110822105243 </STR> <STR name = "title"> baidu, you will know </STR> <date name = "tstamp"> 2011-08-22t14: 56: 09.194z </date> <STR name = "url"> http://www.baidu.com/</STR> </DOC> </result> </response>
   
   
    
    
    
    6. Reference
    
    Http://wiki.apache.org/nutch/RunningNutchAndSolr

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More