Nutch plug-in development [data collation]

Source: Internet
Author: User

Plugin (plug-in) provides some powerful parts for Nutch, many of which are implemented using plug-ins, and users can develop more suitable plugins for themselves.

What are the benefits of Nutch using such a plugin system:

1: Extensibility        By Plugin,nutch allows anyone to extend its functionality, and all we have to do is make a simple implementation of the given interface, for example: we use LoadBalance to download the plugin in Nutch, which is an implementation of the Protocol interface.

2: Flexibility       because everyone can write their own plugin according to their own needs, so plugin will have a very powerful resource pool. In this way, for the application Nutch programmer, he can install the plug-in that fits his needs on his own search engine, and these plugins are in Nutch plugins. This should be a great boon for developers who are applying nutch, because you have more algorithms for content extraction to choose from, and it's easy to add a variety of filtering rules, download methods, parsing file types, and more.


3: maintainability
Every developer has to focus on their own issues. For the kernel developer, while extending the engine core, add an interface to the plugin that describes it. A plugin developer needs to focus on what the plugin is going to do without needing to know how the whole system works. They just need to know the type of data exchanged between plugin and plug. This makes the kernel simpler and easier to maintain.

How the plugin works

Nutch's plugin system is based on the use of plug-ins in Eclipse 2.x. Plugins is very important to the work of Nutch. Nutch fetch (download), parsing (analysis), indexing (index), searching (query) are all implemented by different plugins.

when writing a plugin, add one or more extensions to an extension point. The extension points of these nutch are nutch defined in a plugin, this plugin is nutchextensionpoints (all extension points will be nutchextensionpoints The Plugin.xml is listed in this file). Each extension point defines an interface that must be implemented when it is extended.


These extension points are as follows:
indexingfiltering:
Org.apache.nutch.indexer.IndexingFilter
allows you to add metadata for the field in the index. All implementations of this interface plugin will run sequentially in the process of analysis.

Parser:
Org.apache.nutch.parse.Parser If you want to extend the analysis of a new content type in Nutch or extract more data from the existing content that can be analyzed. Interface parser can be implemented to read the captured document, extracting the data that will be indexed.

Htmlparsefilter:
Org.apache.nutch.parse.HtmlParseFilter
add additional metadata for HTML parser
This interface is an extension point to the parser for the DOM tree-based HTML document, which allows you to add metadata to the htmlparsers.


protocol:
Org.apache.nutch.protocol.Protocol
Implementing Protocol Plugin allows Nutch to use more network protocols (FTP,HTTP) to crawl data

Urlfileter:
Org.apache.nutch.net.URLFilter
The plugin that implements this extension point limits the URLs of the pages nutch to crawl, Regexurlfilter provides control over the URLs of Nutch crawling Web pages through regular expressions. If you have more complex control requirements for URLs, you can write an implementation of this urlfilter

Urlnormalizer:
Org.apache.nutch.net.URLNormalizer
URL Normalization

Scoringfilter:
Org.apache.nutch.scoring.ScoringFilter
nutch Fractional Compute interface, which can be implemented by implementing this interface to affect the calculation of Nutch fractions, the default OPIc fractional calculation method is provided by class Opicscoringfilter. In addition, the classes that participate in calculating fractions have scoringfilters,parseoutputformat. Among them, Scoringfilters is responsible for loading each scoring plug-in into the system and realizing the chain scoring process, while Parseoutputformat assigns the score of the page to each sub-page of the page before parsing the result output. Enables the Nutch update module to use this data to update the CRAWLDB database.

Segmentmergefilter:
Org.apache.nutch.segment.SegmentMergeFilter

Segment Merge Filter interface. Merge and filter multiple segmnet according to certain rules.

the internal structure of the plugin


Protocol extension Point plug-in development

protocol For example, we have changed the way data is downloaded in Nutch and downloaded it in our own way.
First, create a new source directory, which contains 3 XML, a source program directory.

plugin.xml: Describe this plugin information to Nutch
build.xml: Tell ant how to compile this plugin
ivy.xml: This plugin Ivy configuration information

Http.java: Implement protocol interface, here is simple, inherit from Httpbase class, Httpbase class is a class of API meaning, implement protocol interface, so we can reduce a lot of work.



Httpreponse.java: Used to implement the download and return information.

Figure 1:


Figure 2, Connect:


plugin.xml Description
<plugin
id= "Protocol-http-netty" Plugin ID
name= "protocol http Netty Plug-in" plug-in name
version= "1.0.0" plugin version
provider-name= "Pycredit" > Plugin provider's ID

<runtime>
<library name= "Protocol-http-netty.jar" > released Jar package
<export name= "*"/>
</library>
</runtime>

<requires>
<import plugin= "nutch-extensionpoints"/> Dependent plug- in
<import plugin= "Lib-http" >
</requires>

<extension id= "org.apache.nutch.protocol.netty.http" extended plug-in ID
name= Plug-in name for "httpprotocol" extension
point= "Org.apache.nutch.protocol.Protocol" > Plug-in extension point ID
<implementation id= "org.apache.nutch.protocol.netty.http.Http" plug-in implementation ID
class= "Org.apache.nutch.protocol.netty.http.Http" > Implementation class
<parameter name= "protocolname" value= "http"/>plug -In Parameters
</implementation>
-
</extension>
</plugin>

To implement the interface after writing the code, modify the configuration steps as follows:

1, in the Src/plugin/build.xml, in
<target name= "Deploy" >;
<target name= "test" >;
<target name= "clean" >;
Add the appropriate configuration separately, adding our newly developed plugin, such as:
<ant dir= "Protocol-http-netty" target= "deploy" >

2. Modify the Nutch/build.xml file in
<target name= "release" depends= "Compile-core" description= "generate the release distribution" >
Add the configuration as follows:
<packageset dir= "${plugins.dir}/protocol-http-netty/src/java"/>
3, pay attention to check ${plugins.dir}/protocol-http-netty/src/build.xml, its project tag name value is "Protocol-http-netty"

Using plugin in Nutch

If you want to use a given plugin in Nutch, you need to edit the Conf/nutch-site.xml and add the plugin name to the Plugin.includes

<property>
<name>searcher.dir</name>
<value>I:/nutch-0.7.1/crawled</value>
</property>
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html) |index-basic|query-(basic| Site|url) |recommended
</value>

Nutch plug-in development [data collation]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.