Nutch 1.3 learning notes extended nutch plug-in for customizing index fields

Last Update:2018-12-04 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Extension of the nutch plug-in to implement custom index fields

1. Introduction to using nutch and SOLR 1.1 some basic configurations

Adding properties for HTTP. Agent. Name To CONF/nutch-site.xml
Generate a seed folder, mkdir-P URLs, in which a seed file is generated and a URL such as http://nutch.apache.org/
Edit the conf/regex-urlfilter.txt file, configure the url filter, usually with the default good, you can also add the following configuration, only capture the URL of nutch.apache.org + ^ http: // ([a-z0-9] * \.) * nutch.apache.org/
Use the following command to capture webpages

Bin/nutch crawl URLs-Dir crawl-depth 3-topn 5 description: -Dir capture result directory name-depth of the Depth capture-Maximum number of crawlers at the top layer of topn. Generally, the following contents will be displayed: Crawl/crawldb crawl/linkdb crawl/segments.

Use example to create an index

Bin/nutch solrindex http: // 127.0.0.1: 8983/SOLR/crawldb-linkdb crawldb/segments/* use this command only if you have enabled the default SOLR service and enabled the default SOLR service. The command CD $ {apache_solr_home }/example Java-jar start. jar, the service is enabled.
You can enter the following URL in your browser to test http: // localhost: 8983/SOLR/admin/http: // localhost: 8983/SOLR/admin/stats. jsp.
However, you need to use SOLR in combination with nutch, and add a corresponding policy configuration in SOLR. There is a default configuration in the conf directory of nutch, copy it to the SOLR directory and you can use CP $ {nutch_runtime_home}/CONF/schema. XML $ {apache_solr_home}/example/SOLR/CONF/restart SOLR.
After the index is created, you can use keywords for queries. SOLR returns an XML file by default.

2. Introduction to the index filtering plug-in of nutch

Except for some metadata, such as segment, boost, digest, and nutch, other index fields are all completed through an index filter plug-in, such as index-basic, index-more, index-anchor, all of which are generated by using the plug-in mechanism of nutch. If you want to customize the corresponding index field, you need to implement the indexingfilter interface, which is defined as follows:

/** Extension point for indexing. permits one to add metadata to the indexed * fields. all plugins found which implement this extension point are run * sequentially on the parse. */public interface indexingfilter extends pluggable, retriable {/** The Name Of The extension point. */Final Static string x_point_id = indexingfilter. class. getname ();/*** adds fields or otherwise modifies the document that will be indexed for a * parse. unwanted documents can be removed from indexing by returning a null value. ** @ Param doc document instance for collecting fields * @ Param parse data instance * @ Param URL page url * @ Param datum crawl datum for the page * @ Param inlinks page inlinks *@ return modified (or a new) document instance, or null (meaning the document * shocould be discarded) * @ throws indexingexception */nutchdocument filter (nutchdocument doc, parse, text URL, crawldatum datum, inlinks) throws indexingexception ;}

It mainly implements the abstract interface of filter.
For the reduce method of indexermapreduce. Java, the "indexing filters" plug-in is called to set the index fields.
3. Write your own index filtering plug-in

If we need to customize the Field Values of the index file, for example, to generate a metadata and fetchtime field, and query and display it in SOLR, the following is the corresponding plug-inCodeAnd some instructions
At this time, you must first write an index filter. The Code is as follows:

Package Org. apache. nutch. indexer. metadata;/*** licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. see the notice file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to you under the Apache license, version 2.0 * (the "License"); you may not use this file except T in compliance with * the license. you may obtain a copy of the license at ** http://www.apache.org/licenses/LICENSE-2.0 ** unless required by applicable law or agreed to in writing, software * distributed under the license is distributed on an "as is" basis, * Without warranties or conditions of any kind, either express or implied. * See the license for the specific language governing permissions and * limitations under the license. */import Org. slf4j. logger; import Org. slf4j. loggerfactory; import Org. apache. nutch. parse. parse; import Org. apache. nutch. parse. parsedata; import Org. apache. nutch. indexer. indexingfilter; import Org. apache. nutch. indexer. indexingexception; import Org. apache. nutch. indexer. nutchdocument; import Org. apache. hadoop. io. text; import Org. apache. nutch. crawl. crawldatum; import Org. apache. nutch. crawl. inlinks; import Java. util. date; import Org. apache. hadoop. conf. configuration;/*** add (or reset) A few metadata properties as respective fields (if they are * available), so that they can be displayed by more. JSP (called by search. JSP ). ** @ author Lemo Lu */public class metadataindexingfilter implements indexingfilter {public static final logger log = loggerfactory. getlogger (metadataindexingfilter. class); Private configuration conf; Public nutchdocument filter (nutchdocument doc, parse, text URL, crawler datum, inlinks) throws indexingexception {// Add metadata field addmetadata (Doc, parse. getdata (), datum); // Add fetch time field addfetchtime (Doc, parse. getdata (), datum); Return Doc;} private nutchdocument addfetchtime (nutchdocument doc, parsedata data, crawldatum datum) {long fetchtime = datum. getfetchtime (); Doc. add ("fetchtime", new date (fetchtime); Return Doc;} private nutchdocument addmetadata (nutchdocument doc, parsedata data, crawldatum datum) {string metadata = data. getparsemeta (). tostring (); Doc. add ("metadata", metadata); Return Doc;} public void setconf (configuration conf) {This. conf = conf;} public configuration getconf () {return this. conf ;}}

At this time, the plug-in is ready, and the jar package can be packaged into the index-Metadata directory under the Plugins directory. This index-metadata should be created by yourself. Then write a corresponding plugin. xml configuration file, so that the plug-in of nutch can load these modules correctly and dynamically. The plugin. XML is as follows:

<? XML version = "1.0" encoding = "UTF-8"?> <! -- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. see the notice file distributed with this work for additional information regarding copyright ownership. the ASF licenses this file to you under the Apache license, version 2.0 (the "License"); you may not use this file license t in compliance with the license. you may obtain a copy of the license athttp: // your unless required by applicable law or agreed to in writing, software distributed under the license is distributed on an "as is" basis, without warranties or conditions of any kind, either express or implied. see the license for the specific language governing permissions and limitations under the license. --> <plugin id = "index-metadata" name = "metadata indexing filter" version = "1.0.0" provider-name = "nutch.org"> <runtime> <library name = "index-metadata.jar "> <export name =" * "/> </library> </runtime> <requires> <import plugin =" nutch-extensionpoints "/> </requires> <extension ID = "org. apache. nutch. indexer. more "name =" nutch metadata indexing filter "point =" org. apache. nutch. indexer. indexingfilter "> <implementation id =" metadataindexingfilter "class =" org. apache. nutch. indexer. metadata. metadataindexingfilter "/> </extension> </plugin>

At this time, the index filter defined by myself is complete,
The following two files must be configured: schema. xml. Add the following code under the fields Tag:

 <! -- Metadata fields --> <field name = "fetchtime" type = "date" stored = "true" indexed = "true"/> <field name = "metadata" type = "String "Stored =" true "indexed =" true "/>

Note:
Stored indicates that the value of this field must be stored in the Lucene index.
The indexed indicates whether the value of this field is required for word segmentation query.

There is also a solrindex-mapping.xml file, the role of this file is the index filter generated in the field name and schema. XML in a ing relationship, to add the following code in its fields Tag:

 <Field DEST = "fetchtime" Source = "fetchtime"/> <field DEST = "metadata" Source = "metadata"/>

In this way, the custom index filtering plug-in is complete. Remember the schema here. the XML file is in the SOLR/conf directory. You need to restart the file after modification. It is unknown that SOLR does not support modification of the configuration file and takes effect without restarting.

4. Create an index

Use the preceding command to re-create the index.

 Bin/nutch solrindex http: // 127.0.0.1: 8983/SOLR/crawldb-linkdb crawldb/segments /*

Now we can see that the index has been created. By the way, the SOLR index file is in SOLR/data/index. You can use the Luke tool to open the index file, take a look at some of the meta information. At this time, you should be able to see the fetchtime and metadata fields.

5. Query

In this case, you can open the browser, enter http: // localhost: 8983/SOLR/admin/, and then enter some query conditions to see the results, the approximate result is as follows:

This XML file does not appear to have any style information associated with it. the document tree is shown below. <response> <lst name = "responseheader"> <int name = "status"> 0 </int> <int name = "qtime"> 1 </int> <lst name = "Params"> <STR name = "indent"> On </STR> <STR name = "start"> 0 </STR> <STR name = "Q"> </STR> <STR name = "version"> 2.2 </STR> <STR name = "rows"> 10 </STR> </lst> <result name = "response" numfound = "1" Start = "0"> <Doc> <float name = "Boost"> 1.1090536 </float> <STR name = "Digest"> da3aefc69d8a5a7c1ea5447f9680d66d </STR> <date name = "fetchtime"> 2012-04-11t03: 19: 33.088z </date> <STR name = "ID"> http://nutch.apache.org/</STR> <STR name = "metadata"> charencodingforconversion = UTF-8 originalcharencoding = UTF-8 </STR> <STR name = "segment"> 20120410231836 </STR> <STR name = "title"> welcome to Apache nutch </STR> <date name = "tstamp"> 2012-04-11t03: 19: 33.088z </date> <STR name = "url"> http://nutch.apache.org/</STR> </DOC> </result> </response>

Have you seen the values of the two fields we have defined?

6. Reference

Http://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More