Next, I will use an example to explain in detail how to customize the plug-in the nutch.
1. First create a folder in src/plugin/. This folder is called urlfilter-urllength.
From the name, we can see what the role of our custom plug-in is.
1. We have implemented urlfilter in this class. Of course, there is no way to implement it. Public class urllengthfilter implements urlfilter.
2. See the detailed source code below:
Public class urllengthfilterImplements urlfilter
{
Private Static final log = logfactory. getlog (urllengthfilter. Class );
Private configuration conf;
@ Override
Public String filter (string inurl ){
Log.info ("begin urllengthfilter is .....");
String urlfilter = "";
If (inurl = NULL | inurl = "") return urlfilter;
String url = inurl. tolowercase ();
// From first character after 'HTTP: // 'or first character
Int start = URL. indexof ("http ://");
Start = start <0? (URL. indexof ("https: //") = 0? 8: 0): 7;
Url = URL. substring (start );
Int end = URL. indexof ("/");
End = end <0? URL. Length (): end;
// Return the first character to the first or end
Urlfilter = URL. substring (0, end );
Log.info ("urlfilter is" + urlfilter );
Return urlfilter;
}
@ Override
Public configuration getconf (){
Return conf;
}
@ Override
Public void setconf (configuration conf ){
This. conf = conf;
}
}
2. Of course, you need to use the plug-in. xml. xml function! Obviously, it is to load the class we wrote (to implement a certain function ). The code is pasted out for analysis:
<? XML version = "1.0" encoding = "UTF-8"?>
<Plugin
Id = "urlfilter-urllength"
Name = "suffix url filter"
Version = "1.0.0"
Provider-name = "xp.com">
<Requires>
<Import plugin = "nutch-extensionpoints"/>
</Requires>
<Extension id = "com. XP. se. Test. urllength"
Name = "nutch URL Length filter"
Point ="Org.apache.nutch.net. urlfilter
">
<Implementation id = "urllengthfilter" class = "com. XP. se. urllength. urllengthfilter"/>
</Extension>
</Plugin>
Analysis:
A. You can see the encoding format and version of the XML statement.
B. ID: ID of the plug-in.
Name: what is the role of the plug-in.
Version: the version of the plug-in.
Provider-Name: Who provided (source ).
C,Nutch-extensionpoints
: Several ntuch plug-ins are provided here. We will introduce it here.
D. Extension:
Extension, as the name implies, that is, to use the plug-in we have prepared for the use of nutch. I believe that all those who have read the source code of nutch know that there is also an extension-point.
Relationship. But it doesn't matter if you don't know. Just do it later. Go back to Google.
E. ID: The name defined.
Name: what is the role of this extension. What is the function. After reading this, you can see it clearly.
Point: it is integrated from the extension point of nutch. This point is inA specific ID is consistent
.
Implementation: class we write.
3. There is usually another build. xml. This is used for compilation.
4. Add the plug-in we have written in the nutch-site and try again.
<Property>
<Name> plugin. Includes </Name>
<Value> protocol-HTTP | urlfilter-(RegEx |Urllength
) | Parse-(Text | HTML | JS) | index-(Basic | anchor) | query-(Basic | site | URL) | response-(JSON | XML) | Summary-Basic | scoring-OPIC | urlnormalizer-(pass | RegEx | basic) </value>
<Description> Regular Expression naming plugin directory names
Include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin.
Default maid fig HTML and plain text via HTTP,
And basic indexing and search plugins. In order to Use https Please enable
Protocol-httpclient, but be aware of possible intermittent problems with
Underlying commons-httpclient library.
</Description>
</Property>
Careful friends have found that we have already added it. If you haven't found it, take a closer look at it ....
Summary:
You can trace it yourself. Come in. For example, if we write a URL in the URL: http://www.163.com/through which we write this will become: www.163.com. The program will be terminated. The purpose of writing this article is to clarify the loading process and how to learn the plug-in of this nutch.
Customize a plug-in. We can write a lot of plugins, which can be implemented through the method described above.
Disclaimer: The copyright of the javaeye article belongs to the author and is protected by law. You shall not reprint this document without the written consent of the author.