Nutch 1.3 Study Notes 1

Source: Internet
Author: User
Tags solr

Nutch 1.3 Study Notes 1
--------------------
1. What is nutch?

As an open-source web page capturing tool, nutch is mainly used to collect web page data, analyze it, and create indexes to provide corresponding interfaces for querying its web page data. The underlying layer uses hadoop for Distributed Computing and storage, and the index uses the SOLR distributed index framework. SOLR is an open-source full-text index framework, which has been integrated with the index architecture since the beginning of nutch 1.3.

2. Where can I download the latest nutch?

Download the latest nutch 1.3 Binary Package and source code from the address below
Http://mirror.bjtu.edu.cn/apache//nutch/

3. How do I configure nutch? 3.1 decompress the downloaded compressed package, and then CD $ home/nutch-1.3/runtime/local 3.2 configure the bin/nutch file permissions, use chmod + x bin/nutch 3.3 to configure java_home, use Export java_home = $ path4. what preparations should I do before capturing?

4.1 configure the HTTP. Agent. Name attribute as follows:

<property> <name>http.agent.name</name> <value>My Nutch Spider</value></property>

4.2 create an address directory, mkdir-P URLs

Create a URL file in this directory and write some URLs, such

http://nutch.apache.org/

4.3 run the following command:

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

Note that no index is provided here. If you want to create an index for the captured data, run the following command:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
5. What does the crawling process look like? 5.1 initialize crawldb and inject the initial URL

bin/nutch inject Usage: Injector <crawldb> <url_dir>



The output result after running this command locally is as follows:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch inject db/crawldb urls/Injector: starting at 2011-08-22 10:50:01Injector: crawlDb: db/crawldbInjector: urlDir: urlsInjector: Converting injected urls to crawl db entries.Injector: Merging injected urls into crawl db.Injector: finished at 2011-08-22 10:50:05, elapsed: 00:00:03
5.2 generate new URLs

bin/nutch generateUsage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]

The output result of the local machine is as follows:

Lemo @ Debian :~ /Workspace/Java/Apache/nutch/nutch-1.3 $ bin/nutch generate dB/crawldb/DB/segmentsgenerator: starting at 10: 52: 41 generator: selecting best-scoring URLs due for Fetch. generator: Filtering: truegenerator: normalizing: truegenerator: jobtracker is 'local', generating exactly one partition. generator: partitioning selected URLs for politeness. generator: Segment: DB/segments/20110822105243 // here a new segmentgenerator: finished at 10:52:44, elapsed: 00:00:03
5.3 capture the URL generated above

bin/nutch fetchUsage: Fetcher <segment> [-threads n] [-noParsing]

Here is the local output result:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch fetch db/segments/20110822105243/Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.Fetcher: starting at 2011-08-22 10:56:07Fetcher: segment: db/segments/20110822105243Fetcher: threads: 10QueueFeeder finished: total 1 records + hit by time limit :0fetching http://www.baidu.com/-finishing thread FetcherThread, activeThreads=1-finishing thread FetcherThread, activeThreads=-finishing thread FetcherThread, activeThreads=1-finishing thread FetcherThread, activeThreads=1-finishing thread FetcherThread, activeThreads=0-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0-activeThreads=0Fetcher: finished at 2011-08-22 10:56:09, elapsed: 00:00:02

Let's take a look at the Segment directory structure here.

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ ls db/segments/20110822105243/content  crawl_fetch  crawl_generate
5.4 parse the above results

bin/nutch parseUsage: ParseSegment segment


Local output result:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch parse db/segments/20110822105243/ParseSegment: starting at 2011-08-22 10:58:19ParseSegment: segment: db/segments/20110822105243ParseSegment: finished at 2011-08-22 10:58:22, elapsed: 00:00:02


Let's take a look at the parsed directory structure.

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ ls db/segments/20110822105243/content  crawl_fetch  crawl_generate  crawl_parse  parse_data  parse_text


Three parsed directories are added here.


5.5 update external link database

bin/nutch updatedbUsage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions]

Local output result:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch updatedb db/crawldb/ -dir db/segments/CrawlDb update: starting at 2011-08-22 11:00:09CrawlDb update: db: db/crawldbCrawlDb update: segments: CrawlDb update: additions allowed: trueCrawlDb update: URL normalizing: falseCrawlDb update: URL filtering: falseCrawlDb update: Merging segment data into db.CrawlDb update: finished at 2011-08-22 11:00:10, elapsed: 00:00:01


At this time, it will update the Linked Library of the crawler LDB, which is stored in the file system. For example, the Linked Library of the Taobao crawling program is made of redis, a key-value nosql database.

5.6 calculate the reverse link

bin/nutch invertlinksUsage: LinkDb <linkdb> (-dir <segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize] [-noFilter]


Local output result:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch invertlinks db/linkdb -dir db/segments/LinkDb: starting at 2011-08-22 11:02:49LinkDb: linkdb: db/linkdbLinkDb: URL normalize: trueLinkDb: URL filter: trueLinkDb: adding segment: file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243LinkDb: finished at 2011-08-22 11:02:50, elapsed: 00:00:01


5.7 use SOLR to create an index for captured content

bin/nutch solrindexUsage: SolrIndexer <solr url> <crawldb> <linkdb> (<segment> ... | -dir <segments>

The output of the nutch end is as follows:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch solrindex http://127.0.0.1:8983/solr/ db/crawldb/ db/linkdb/ db/segments/*
SolrIndexer: starting at 2011-08-22 11:05:33
SolrIndexer: finished at 2011-08-22 11:05:35, elapsed: 00:00:02

Some SOLR outputs are as follows:

INFO: SolrDeletionPolicy.onInit: commits:num=1        commit{dir=/home/lemo/Workspace/java/Apache/Solr/apache-solr-3.3.0/example/solr/data/index,segFN=segments_1,version=1314024228223,generation=1,filenames=[segments_1]Aug 22, 2011 11:05:35 AM org.apache.solr.core.SolrDeletionPolicy updateCommitsINFO: newest commit = 1314024228223Aug 22, 2011 11:05:35 AM org.apache.solr.update.processor.LogUpdateProcessor finishINFO: {add=[http://www.baidu.com/]} 0 183Aug 22, 2011 11:05:35 AM org.apache.solr.core.SolrCore executeINFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=0 QTime=183Aug 22, 2011 11:05:35 AM org.apache.solr.update.DirectUpdateHandler2 commitINFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)

5.8 query on SOLR Client
Enter

http://localhost:8983/solr/admin/

The query condition is Baidu.

The output XML structure is

If you want to display the content in SOLR configuration file solrconfig. XML in the HTML structure as follows, you can
<Field name = "content" type = "text" stored = "true" indexed = "true"/>

 

<Response> <lst name = "responseheader"> <int name = "status"> 0 </int> <int name = "qtime"> 0 </int> <lst name = "Params"> <STR name = "indent"> On </STR> <STR name = "start"> 0 </STR> <STR name = "Q"> Baidu </STR> <STR name = "version"> 2.2 </STR> <STR name = "rows"> 10 </STR> </lst> <result name = "response" numfound = "1" Start = "0"> <Doc> <float name = "Boost"> 1.0660036 </float> <STR name = "Digest"> 7be5cfd6da4a058001300b21d7d96b0f </STR> <STR name = "ID"> http://www.baidu.com/</STR> <STR name = "segment"> 20110822105243 </STR> <STR name = "title"> baidu, you will know </STR> <date name = "tstamp"> 2011-08-22t14: 56: 09.194z </date> <STR name = "url"> http://www.baidu.com/</STR> </DOC> </result> </response>

   
   

    
    

6. Reference

Http://wiki.apache.org/nutch/RunningNutchAndSolr


    
    


    
    

    
    

    
    

    
    

    
    

    
    

   
   

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.