Use nutch for classification search (1) (add index plugin)

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Simsun "> when you use Google, you will find that you can search by category, such as news, blog, and shopping, this feature will be implemented in this series of articles by adding plug-ins to the nutch. In this series of articles, we assume that you have a certain understanding of nutch, can successfully compile and easily configure it, and use the crawl provided by nutch to capture webpages.
This article describes how to add our index-typePlugin.
When you use Luke to view captured data, you can find that there are more than a dozen fileds by default, such as title, URL, and content. We need to add a typeField indicates the website type.
Create the index-type directory under the src/plugin directory (refer to the index-Basic directory structure) and add the following three java files. Note that the package name must be consistent with the created file directory structure.
Typenamefactory implements the indexingfilter interface, extentionThe manager will serve as the entry.
+ExpandSourceViewPlain
Typenamefactory obtains the type based on the value of indexer. type. regiondes, reads the corresponding rule file according to the type, and puts it in hashmap.
Consolas "> +ExpandSourceViewPlain
Typenameselector corresponds to a Rule file and provides a filter interface to filter URLs.
Consolas "> +ExpandSourceViewPlain
Add org. Apache. nutch. urlfilter. API. Rule to Lib-RegEx-filter.
Consolas "> +ExpandSourceViewPlain
Add build. XML to use the Lib-RegEx-filter library, so Add references.
Consolas "> +ExpandSourceViewPlain
Add plugin. xml. Pay attention to extentionPoint is "org. Apache. nutch. indexer. indexingfilter", which is a public interface of indexing.
Consolas "> +ExpandSourceViewPlain
Another important thing is to add this index-type to the compilation system, modify src/plugin/build. XML, and add the following compilation entry:
Consolas "> ViewPlain

<TargetName="Deploy">
<Ant Dir="Index-type" Target="Deploy"/>
</Target>
<Target Name="Clean">
<Ant Dir="Index-type" Target="Clean"/>
</Target>

At this pointPackage will be compiled into our index-type, check whether the compilation result is correct to produce index-type.jar.
After successful compilation, let's try to capture the web page, before you need to modify the conf/nutch-site.xml:
Add indexer. type. the parameter des. The value is of various webpage types. Multiple types are separated by "|. add index-type to the value of the pair des parameter, so that index-type is automatically loaded when the system starts.Plugin.
Consolas "> +ExpandSourceViewPlain
For each type of web page, upload a regular file. The following crawl-urltype-news.txt is a rule designed for news.
Consolas "> ViewPlain

#AcceptHostsIn
+ ^Http://news.sina.com.cn/
+ ^Http://news.baidu.com/
+ ^Http://news.163.com/
#SkipEverythingElse
-.

In addition, if the type is 4, the file naming rule is "crawler-urltype-(yourwebtype.txt, yourwebtype" corresponds to values in the indexer. type. includes parameters one by one, and the file is placed in the conf directory.
Then set the starting URLs of crawl and runCrawl can capture data on the network. You can use Luke to view the final result and find that there is an additional typeField. If there are enough webpages to be crawled, five values are displayed, which are defined in indexer. type. Includes.
At this point, index is able to set its type according to different webpage types.The field is being indexed.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use nutch for classification search (1) (add index plugin)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use nutch for classification search (1) (add index plugin)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support