Nutch Index Source Code resolution (i)

Last Update:2014-12-28 Source: Internet

Author: User

Keywords Nbsp; value null rita

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Nutch Index Source Code Analysis (i) blog Category: Large data processing Research Nutchsolrhadoop index

Introduction of indexing method for Nutch integrated Slor

/**

* Indexing

* @param solrurl SOLR web address

* @param crawldb Crawl db storage path: \CRAWL\CRAWLDB

* @param linkdb Crawl link storage path: \CRAWL\LINKDB

* @param segments Metadata storage path: \crawl\segments

* @param nocommit Submit Slor Server Slor Index

* @param deletegone Delete obsolete documents

* @param solrparams SOLR parameters

* @param filter to enable URL filtering

* @param normalize format URL

* @throws IOException

public void Indexsolr (String solrurl, Path crawldb, Path LINKDB,

List<path> segments, Boolean Nocommit, Boolean deletegone, String solrparams,

Boolean filter, Boolean normalize) throws IOException {

...

Indexermapreduce.initmrjob (Crawldb, LINKDB, segments, job);

...

}

The index of Nutch is accomplished by a Mr.

The input of map is the sequencefile of the Nutch crawl directory, the key is the Java generics used by Nutch crawl down url,value, and all Nutch custom data types are abstracted into a nutchwritable object.

The data types contained in nutchwritable are the following source code:

CLASSES = new class[] {

Org.apache.hadoop.io.NullWritable.class,

Org.apache.hadoop.io.BooleanWritable.class,

Org.apache.hadoop.io.LongWritable.class,

Org.apache.hadoop.io.BytesWritable.class,

Org.apache.hadoop.io.FloatWritable.class,

Org.apache.hadoop.io.IntWritable.class,

Org.apache.hadoop.io.MapWritable.class,

Org.apache.hadoop.io.Text.class,

Org.apache.hadoop.io.MD5Hash.class,

Org.apache.nutch.crawl.CrawlDatum.class,

Org.apache.nutch.crawl.Inlink.class,

Org.apache.nutch.crawl.Inlinks.class,

Org.apache.nutch.fetcher.FetcherOutput.class,

Org.apache.nutch.metadata.Metadata.class,

Org.apache.nutch.parse.Outlink.class,

Org.apache.nutch.parse.ParseText.class,

Org.apache.nutch.parse.ParseData.class,

Org.apache.nutch.parse.ParseImpl.class,

Org.apache.nutch.parse.ParseStatus.class,

Org.apache.nutch.protocol.Content.class,

Org.apache.nutch.protocol.ProtocolStatus.class,

Org.apache.nutch.scoring.webgraph.LinkDatum.class,

};

These data types abstract the data types of the Nutch at each stage of the crawl.

The map phase does not process the value, and only the URL is processed, and the processing code is as follows:

String urlstring = Filterurl (Normalizeurl (key.tostring ()));

The call is to filter and format URLs based on well-defined filtering rules, and of course it can be set by invoking the command.

Reduce handles all crawled data, with code comments as follows:

/**

* Output format: URL as key, index action as value

public void reduce (Text key, iterator<nutchwritable> values,

Outputcollector<text, nutchindexaction> output, Reporter Reporter)

Throws IOException {

InLinks inlinks = null;

Crawldatum dbdatum = null;

Crawldatum fetchdatum = null;

Parsedata parsedata = null;

ParseText parsetext = null;

while (Values.hasnext ()) {

Final writable value = Values.next (). get (); Unwrap

If the data type of the URL injection

if (value instanceof InLinks) {

InLinks = (inlinks) value;

If the data type is crawled

else if (value instanceof crawldatum) {

Final Crawldatum datum = (crawldatum) value;

If the current data is in DB injection state

if (Crawldatum.hasdbstatus (datum)) {

Dbdatum = Datum;

}

If the current data is in a crawl completion state.

else if (Crawldatum.hasfetchstatus (datum)) {

Don ' t index unmodified (empty) pages

To determine whether the climb has been modified

if (Datum.getstatus ()!= crawldatum.status_fetch_notmodified) {

Fetchdatum = Datum;

/**

* Check If we need to delete 404 Not FOUND and permanent REDIRECT.

Delete Error and obsolete pages if setting is true in Parameters

if (delete) {

If the crawled page expires, take a delete operation.

if (fetchdatum.getstatus () = = Crawldatum.status_fetch_gone) {

Reporter.incrcounter ("Indexerstatus", "Documents deleted", 1);

Nutchindexaction action = new nutchindexaction (null, nutchindexaction.delete);

Output.collect (key, action);

Return;

}

If the crawled page has been redirected to another page, delete the operation.

if (fetchdatum.getstatus () = = Crawldatum.status_fetch_redir_perm) {

Reporter.incrcounter ("Indexerstatus", "Perm redirects deleted", 1);

Nutchindexaction action = new nutchindexaction (null, nutchindexaction.delete);

Output.collect (key, action);

Return;

}

URLs are found via other URLs | | Page Signature | | The metadata for the page is generated by the parser

else if (crawldatum.status_linked = Datum.getstatus () | |

Crawldatum.status_signature = = Datum.getstatus () | |

Crawldatum.status_parse_meta = = Datum.getstatus ()) {

re-enters;

else {

throw new RuntimeException ("Unexpected Status:" +datum.getstatus ());

}

If the data type is resolved

else if (value instanceof Parsedata) {

Parsedata = (parsedata) value;

Handle a meta? https://issues.apache.org/jira/browse/NUTCH-1434

if (Deleterobotsnoindex) {

Get the Robots meta data

String Robotsmeta = Parsedata.getmeta ("robots");

Super-delegates it a noindex for this URL?

if (Robotsmeta!= null && robotsmeta.tolowercase (). IndexOf ("Noindex")!=-1) {

Delete it!

Nutchindexaction action = new nutchindexaction (null, nutchindexaction.delete);

Output.collect (key, action);

Return;

}

Parsed text file

else if (value instanceof ParseText) {

ParseText = (parsetext) value;

else if (log.iswarnenabled ()) {

Log.warn ("Unrecognized type:" +value.getclass ());

}

If only links, no crawl history or crawl data are returned directly

if (fetchdatum = null | | dbdatum = NULL

|| ParseText = NULL | | Parsedata = = null) {

Return; Only have InLinks

}

Whether to skip db_notmodified pages

If the page has been crawled, but no repairs have been made, skipping is set in the incoming command.

if (Skip && dbdatum.getstatus () = crawldatum.status_db_notmodified) {

Reporter.incrcounter ("Indexerstatus", "skipped", 1);

Return;

}

Page crawled to success, but parsing failed, direct return

if (!parsedata.getstatus (). issuccess () | |

Fetchdatum.getstatus ()!= crawldatum.status_fetch_success) {

Return;

}

Nutchdocument doc = new nutchdocument ();

Get the metadata for the page in the parsed data

Final Metadata Metadata = Parsedata.getcontentmeta ();

Add segment, used to map in merged index back to segment files

Doc.add ("segment", Metadata.get (Nutch.segment_name_key));

Page Summary

Add digest, used by dedup

Doc.add ("Digest", Metadata.get (Nutch.signature_key));

Final Parse Parse = new Parseimpl (ParseText, parsedata);

try {

Extract information from Dbdatum

Fetchdatum so this indexing filters can use it

Final Text url = (Text) dbdatum.getmetadata (). get (Nutch.writable_repr_url_key);

if (URL!= null) {

Fetchdatum.getmetadata (). Put (Nutch.writable_repr_url_key, URL);

}

Run indexing filters

Execute all Filters

doc = This.filters.filter (doc, Parse, key, Fetchdatum, InLinks);

catch (Final indexingexception e) {

if (log.iswarnenabled ()) {Log.warn ("Error Indexing" +key+ ":" +e);}

Reporter.incrcounter ("Indexerstatus", "Errors", 1);

Return;

}

Skip documents discarded by indexing filters

if (doc = = null) {

Reporter.incrcounter ("Indexerstatus", "Skipped by Filters", 1);

Return;

}

float boost = 1.0f;

Run Scoring filters

Perform a grading filter

try {

Boost = This.scfilters.indexerScore (key, Doc, Dbdatum,

Fetchdatum, Parse, InLinks, boost);

catch (Final scoringfilterexception e) {

if (log.iswarnenabled ()) {

Log.warn ("Error Calculating score" + Key + ":" + E);

}

Return;

}

Weighting the score as a document

Apply boost to all indexed fields.

Doc.setweight (boost);

Store boost for use by explain and dedup

Doc.add ("Boost", float.tostring (boost));

Reporter.incrcounter ("Indexerstatus", "Documents added", 1);

Nutchindexaction action = new Nutchindexaction (doc, Nutchindexaction.add);

Output.collect (key, action);

}

The current study to this, the following to be continued.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More