Use nutch for classification search (1) (add index plugin)

Source: Internet
Author: User

Simsun "> when you use Google, you will find that you can search by category, such as news, blog, and shopping, this feature will be implemented in this series of articles by adding plug-ins to the nutch. In this series of articles, we assume that you have a certain understanding of nutch, can successfully compile and easily configure it, and use the crawl provided by nutch to capture webpages.
This article describes how to add our index-typePlugin.
When you use Luke to view captured data, you can find that there are more than a dozen fileds by default, such as title, URL, and content. We need to add a typeField indicates the website type.
Create the index-type directory under the src/plugin directory (refer to the index-Basic directory structure) and add the following three java files. Note that the package name must be consistent with the created file directory structure.
Typenamefactory implements the indexingfilter interface, extentionThe manager will serve as the entry.
+ExpandSourceViewPlain
 Typenamefactory obtains the type based on the value of indexer. type. regiondes, reads the corresponding rule file according to the type, and puts it in hashmap.
Consolas "> +ExpandSourceViewPlain
 Typenameselector corresponds to a Rule file and provides a filter interface to filter URLs.
Consolas "> +ExpandSourceViewPlain
 Add org. Apache. nutch. urlfilter. API. Rule to Lib-RegEx-filter.
Consolas "> +ExpandSourceViewPlain
 Add build. XML to use the Lib-RegEx-filter library, so Add references.
Consolas "> +ExpandSourceViewPlain
 Add plugin. xml. Pay attention to extentionPoint is "org. Apache. nutch. indexer. indexingfilter", which is a public interface of indexing.
Consolas "> +ExpandSourceViewPlain
 Another important thing is to add this index-type to the compilation system, modify src/plugin/build. XML, and add the following compilation entry:
Consolas "> ViewPlain

  1. <TargetName="Deploy">  
  2.    <Ant Dir="Index-type" Target="Deploy"/>  
  3. </Target>  
  4. <Target Name="Clean">  
  5.    <Ant Dir="Index-type" Target="Clean"/>  
  6. </Target>  

 At this pointPackage will be compiled into our index-type, check whether the compilation result is correct to produce index-type.jar.
After successful compilation, let's try to capture the web page, before you need to modify the conf/nutch-site.xml:
Add indexer. type. the parameter des. The value is of various webpage types. Multiple types are separated by "|. add index-type to the value of the pair des parameter, so that index-type is automatically loaded when the system starts.Plugin.
Consolas "> +ExpandSourceViewPlain
 For each type of web page, upload a regular file. The following crawl-urltype-news.txt is a rule designed for news.
Consolas "> ViewPlain

  1. #AcceptHostsIn  
  2. + ^Http://news.sina.com.cn/ 
  3. + ^Http://news.baidu.com/ 
  4. + ^Http://news.163.com/ 
  5. #SkipEverythingElse 
  6. -. 

 In addition, if the type is 4, the file naming rule is "crawler-urltype-(yourwebtype.txt, yourwebtype" corresponds to values in the indexer. type. includes parameters one by one, and the file is placed in the conf directory.
Then set the starting URLs of crawl and runCrawl can capture data on the network. You can use Luke to view the final result and find that there is an additional typeField. If there are enough webpages to be crawled, five values are displayed, which are defined in indexer. type. Includes.
At this point, index is able to set its type according to different webpage types.The field is being indexed.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.