Simsun "> when you use Google, you will find that you can search by category, such as news, blog, and shopping, this feature will be implemented in this series of articles by adding plug-ins to the nutch. In this series of articles, we assume that you have a certain understanding of nutch, can successfully compile and easily configure it, and use the crawl provided by nutch to capture webpages.
This article describes how to add our index-typePlugin.
When you use Luke to view captured data, you can find that there are more than a dozen fileds by default, such as title, URL, and content. We need to add a typeField indicates the website type.
Create the index-type directory under the src/plugin directory (refer to the index-Basic directory structure) and add the following three java files. Note that the package name must be consistent with the created file directory structure.
Typenamefactory implements the indexingfilter interface, extentionThe manager will serve as the entry.
+ExpandSourceViewPlain
Typenamefactory obtains the type based on the value of indexer. type. regiondes, reads the corresponding rule file according to the type, and puts it in hashmap.
Consolas "> +ExpandSourceViewPlain
Typenameselector corresponds to a Rule file and provides a filter interface to filter URLs.
Consolas "> +ExpandSourceViewPlain
Add org. Apache. nutch. urlfilter. API. Rule to Lib-RegEx-filter.
Consolas "> +ExpandSourceViewPlain
Add build. XML to use the Lib-RegEx-filter library, so Add references.
Consolas "> +ExpandSourceViewPlain
Add plugin. xml. Pay attention to extentionPoint is "org. Apache. nutch. indexer. indexingfilter", which is a public interface of indexing.
Consolas "> +ExpandSourceViewPlain
Another important thing is to add this index-type to the compilation system, modify src/plugin/build. XML, and add the following compilation entry:
Consolas "> ViewPlain
- <TargetName="Deploy">
- <Ant Dir="Index-type" Target="Deploy"/>
- </Target>
- <Target Name="Clean">
- <Ant Dir="Index-type" Target="Clean"/>
- </Target>
At this pointPackage will be compiled into our index-type, check whether the compilation result is correct to produce index-type.jar.
After successful compilation, let's try to capture the web page, before you need to modify the conf/nutch-site.xml:
Add indexer. type. the parameter des. The value is of various webpage types. Multiple types are separated by "|. add index-type to the value of the pair des parameter, so that index-type is automatically loaded when the system starts.Plugin.
Consolas "> +ExpandSourceViewPlain
For each type of web page, upload a regular file. The following crawl-urltype-news.txt is a rule designed for news.
Consolas "> ViewPlain
- #AcceptHostsIn
- + ^Http://news.sina.com.cn/
- + ^Http://news.baidu.com/
- + ^Http://news.163.com/
- #SkipEverythingElse
- -.
In addition, if the type is 4, the file naming rule is "crawler-urltype-(yourwebtype.txt, yourwebtype" corresponds to values in the indexer. type. includes parameters one by one, and the file is placed in the conf directory.
Then set the starting URLs of crawl and runCrawl can capture data on the network. You can use Luke to view the final result and find that there is an additional typeField. If there are enough webpages to be crawled, five values are displayed, which are defined in indexer. type. Includes.
At this point, index is able to set its type according to different webpage types.The field is being indexed.