Let's use an example to explain in detail how to customize the plug-in nutch.

Source: Internet
Author: User

Next, I will use an example to explain in detail how to customize the plug-in the nutch.

1. First create a folder in src/plugin/. This folder is called urlfilter-urllength.
From the name, we can see what the role of our custom plug-in is.
1. We have implemented urlfilter in this class. Of course, there is no way to implement it. Public class urllengthfilter implements urlfilter.
2. See the detailed source code below:
Public class urllengthfilterImplements urlfilter

{
Private Static final log = logfactory. getlog (urllengthfilter. Class );
Private configuration conf;
@ Override
Public String filter (string inurl ){
Log.info ("begin urllengthfilter is .....");
String urlfilter = "";
If (inurl = NULL | inurl = "") return urlfilter;
String url = inurl. tolowercase ();
// From first character after 'HTTP: // 'or first character
Int start = URL. indexof ("http ://");
Start = start <0? (URL. indexof ("https: //") = 0? 8: 0): 7;
Url = URL. substring (start );
Int end = URL. indexof ("/");
End = end <0? URL. Length (): end;
// Return the first character to the first or end
Urlfilter = URL. substring (0, end );
Log.info ("urlfilter is" + urlfilter );
Return urlfilter;
}

@ Override
Public configuration getconf (){
Return conf;
}

@ Override
Public void setconf (configuration conf ){
This. conf = conf;
}
}

2. Of course, you need to use the plug-in. xml. xml function! Obviously, it is to load the class we wrote (to implement a certain function ). The code is pasted out for analysis:
<? XML version = "1.0" encoding = "UTF-8"?>
<Plugin
Id = "urlfilter-urllength"
Name = "suffix url filter"
Version = "1.0.0"
Provider-name = "xp.com">
<Requires>
<Import plugin = "nutch-extensionpoints"/>
</Requires>
<Extension id = "com. XP. se. Test. urllength"
Name = "nutch URL Length filter"
Point ="Org.apache.nutch.net. urlfilter

">
<Implementation id = "urllengthfilter" class = "com. XP. se. urllength. urllengthfilter"/>
</Extension>
</Plugin>

Analysis:
A. You can see the encoding format and version of the XML statement.
B. ID: ID of the plug-in.
Name: what is the role of the plug-in.
Version: the version of the plug-in.
Provider-Name: Who provided (source ).
C,Nutch-extensionpoints

: Several ntuch plug-ins are provided here. We will introduce it here.
D. Extension:
Extension, as the name implies, that is, to use the plug-in we have prepared for the use of nutch. I believe that all those who have read the source code of nutch know that there is also an extension-point.
Relationship. But it doesn't matter if you don't know. Just do it later. Go back to Google.
E. ID: The name defined.
Name: what is the role of this extension. What is the function. After reading this, you can see it clearly.
Point: it is integrated from the extension point of nutch. This point is inA specific ID is consistent

.
Implementation: class we write.
3. There is usually another build. xml. This is used for compilation.
4. Add the plug-in we have written in the nutch-site and try again.
<Property>
<Name> plugin. Includes </Name>
<Value> protocol-HTTP | urlfilter-(RegEx |Urllength

) | Parse-(Text | HTML | JS) | index-(Basic | anchor) | query-(Basic | site | URL) | response-(JSON | XML) | Summary-Basic | scoring-OPIC | urlnormalizer-(pass | RegEx | basic) </value>
<Description> Regular Expression naming plugin directory names
Include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin.
Default maid fig HTML and plain text via HTTP,
And basic indexing and search plugins. In order to Use https Please enable
Protocol-httpclient, but be aware of possible intermittent problems with
Underlying commons-httpclient library.
</Description>
</Property>
Careful friends have found that we have already added it. If you haven't found it, take a closer look at it ....
Summary:
You can trace it yourself. Come in. For example, if we write a URL in the URL: http://www.163.com/through which we write this will become: www.163.com. The program will be terminated. The purpose of writing this article is to clarify the loading process and how to learn the plug-in of this nutch.
Customize a plug-in. We can write a lot of plugins, which can be implemented through the method described above.

Disclaimer: The copyright of the javaeye article belongs to the author and is protected by law. You shall not reprint this document without the written consent of the author.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.