Nutch Index Source Code resolution (i)

Source: Internet
Author: User
Keywords Nbsp; value null rita
Nutch Index Source Code Analysis (i) blog Category: Large data processing Research Nutchsolrhadoop index

Introduction of indexing method for Nutch integrated Slor

/**

* Indexing

* @param solrurl SOLR web address

* @param crawldb Crawl db storage path: \CRAWL\CRAWLDB

* @param linkdb Crawl link storage path: \CRAWL\LINKDB

* @param segments Metadata storage path: \crawl\segments

* @param nocommit Submit Slor Server Slor Index

* @param deletegone Delete obsolete documents

* @param solrparams SOLR parameters

* @param filter to enable URL filtering

* @param normalize format URL

* @throws IOException

*/

public void Indexsolr (String solrurl, Path crawldb, Path LINKDB,

List<path> segments, Boolean Nocommit, Boolean deletegone, String solrparams,

Boolean filter, Boolean normalize) throws IOException {

...

Indexermapreduce.initmrjob (Crawldb, LINKDB, segments, job);

...

}

The index of Nutch is accomplished by a Mr.

The input of map is the sequencefile of the Nutch crawl directory, the key is the Java generics used by Nutch crawl down url,value, and all Nutch custom data types are abstracted into a nutchwritable object.

The data types contained in nutchwritable are the following source code:

CLASSES = new class[] {

Org.apache.hadoop.io.NullWritable.class,

Org.apache.hadoop.io.BooleanWritable.class,

Org.apache.hadoop.io.LongWritable.class,

Org.apache.hadoop.io.BytesWritable.class,

Org.apache.hadoop.io.FloatWritable.class,

Org.apache.hadoop.io.IntWritable.class,

Org.apache.hadoop.io.MapWritable.class,

Org.apache.hadoop.io.Text.class,

Org.apache.hadoop.io.MD5Hash.class,

Org.apache.nutch.crawl.CrawlDatum.class,

Org.apache.nutch.crawl.Inlink.class,

Org.apache.nutch.crawl.Inlinks.class,

Org.apache.nutch.fetcher.FetcherOutput.class,

Org.apache.nutch.metadata.Metadata.class,

Org.apache.nutch.parse.Outlink.class,

Org.apache.nutch.parse.ParseText.class,

Org.apache.nutch.parse.ParseData.class,

Org.apache.nutch.parse.ParseImpl.class,

Org.apache.nutch.parse.ParseStatus.class,

Org.apache.nutch.protocol.Content.class,

Org.apache.nutch.protocol.ProtocolStatus.class,

Org.apache.nutch.scoring.webgraph.LinkDatum.class,

};

These data types abstract the data types of the Nutch at each stage of the crawl.

The map phase does not process the value, and only the URL is processed, and the processing code is as follows:

String urlstring = Filterurl (Normalizeurl (key.tostring ()));

The call is to filter and format URLs based on well-defined filtering rules, and of course it can be set by invoking the command.

Reduce handles all crawled data, with code comments as follows:

/**

* Output format: URL as key, index action as value

*/

public void reduce (Text key, iterator<nutchwritable> values,

Outputcollector<text, nutchindexaction> output, Reporter Reporter)

Throws IOException {

InLinks inlinks = null;

Crawldatum dbdatum = null;

Crawldatum fetchdatum = null;

Parsedata parsedata = null;

ParseText parsetext = null;

while (Values.hasnext ()) {

Final writable value = Values.next (). get (); Unwrap

If the data type of the URL injection

if (value instanceof InLinks) {

InLinks = (inlinks) value;

If the data type is crawled

else if (value instanceof crawldatum) {

Final Crawldatum datum = (crawldatum) value;

If the current data is in DB injection state

if (Crawldatum.hasdbstatus (datum)) {

Dbdatum = Datum;

}

If the current data is in a crawl completion state.

else if (Crawldatum.hasfetchstatus (datum)) {

Don ' t index unmodified (empty) pages

To determine whether the climb has been modified

if (Datum.getstatus ()!= crawldatum.status_fetch_notmodified) {

Fetchdatum = Datum;

/**

* Check If we need to delete 404 Not FOUND and permanent REDIRECT.

*/

Delete Error and obsolete pages if setting is true in Parameters

if (delete) {

If the crawled page expires, take a delete operation.

if (fetchdatum.getstatus () = = Crawldatum.status_fetch_gone) {

Reporter.incrcounter ("Indexerstatus", "Documents deleted", 1);

Nutchindexaction action = new nutchindexaction (null, nutchindexaction.delete);

Output.collect (key, action);

Return;

}

If the crawled page has been redirected to another page, delete the operation.

if (fetchdatum.getstatus () = = Crawldatum.status_fetch_redir_perm) {

Reporter.incrcounter ("Indexerstatus", "Perm redirects deleted", 1);

Nutchindexaction action = new nutchindexaction (null, nutchindexaction.delete);

Output.collect (key, action);

Return;

}

}

}

URLs are found via other URLs | | Page Signature | | The metadata for the page is generated by the parser

else if (crawldatum.status_linked = Datum.getstatus () | |

Crawldatum.status_signature = = Datum.getstatus () | |

Crawldatum.status_parse_meta = = Datum.getstatus ()) {

re-enters;

else {

throw new RuntimeException ("Unexpected Status:" +datum.getstatus ());

}

If the data type is resolved

else if (value instanceof Parsedata) {

Parsedata = (parsedata) value;

Handle a meta? https://issues.apache.org/jira/browse/NUTCH-1434

if (Deleterobotsnoindex) {

Get the Robots meta data

String Robotsmeta = Parsedata.getmeta ("robots");

Super-delegates it a noindex for this URL?

if (Robotsmeta!= null && robotsmeta.tolowercase (). IndexOf ("Noindex")!=-1) {

Delete it!

Nutchindexaction action = new nutchindexaction (null, nutchindexaction.delete);

Output.collect (key, action);

Return;

}

}

Parsed text file

else if (value instanceof ParseText) {

ParseText = (parsetext) value;

else if (log.iswarnenabled ()) {

Log.warn ("Unrecognized type:" +value.getclass ());

}

}

If only links, no crawl history or crawl data are returned directly

if (fetchdatum = null | | dbdatum = NULL

|| ParseText = NULL | | Parsedata = = null) {

Return; Only have InLinks

}

Whether to skip db_notmodified pages

If the page has been crawled, but no repairs have been made, skipping is set in the incoming command.

if (Skip && dbdatum.getstatus () = crawldatum.status_db_notmodified) {

Reporter.incrcounter ("Indexerstatus", "skipped", 1);

Return;

}

Page crawled to success, but parsing failed, direct return

if (!parsedata.getstatus (). issuccess () | |

Fetchdatum.getstatus ()!= crawldatum.status_fetch_success) {

Return;

}

Nutchdocument doc = new nutchdocument ();

Get the metadata for the page in the parsed data

Final Metadata Metadata = Parsedata.getcontentmeta ();

Add segment, used to map in merged index back to segment files

Doc.add ("segment", Metadata.get (Nutch.segment_name_key));

Page Summary

Add digest, used by dedup

Doc.add ("Digest", Metadata.get (Nutch.signature_key));

Final Parse Parse = new Parseimpl (ParseText, parsedata);

try {

Extract information from Dbdatum

Fetchdatum so this indexing filters can use it

Final Text url = (Text) dbdatum.getmetadata (). get (Nutch.writable_repr_url_key);

if (URL!= null) {

Fetchdatum.getmetadata (). Put (Nutch.writable_repr_url_key, URL);

}

Run indexing filters

Execute all Filters

doc = This.filters.filter (doc, Parse, key, Fetchdatum, InLinks);

catch (Final indexingexception e) {

if (log.iswarnenabled ()) {Log.warn ("Error Indexing" +key+ ":" +e);}

Reporter.incrcounter ("Indexerstatus", "Errors", 1);

Return;

}

Skip documents discarded by indexing filters

if (doc = = null) {

Reporter.incrcounter ("Indexerstatus", "Skipped by Filters", 1);

Return;

}

float boost = 1.0f;

Run Scoring filters

Perform a grading filter

try {

Boost = This.scfilters.indexerScore (key, Doc, Dbdatum,

Fetchdatum, Parse, InLinks, boost);

catch (Final scoringfilterexception e) {

if (log.iswarnenabled ()) {

Log.warn ("Error Calculating score" + Key + ":" + E);

}

Return;

}

Weighting the score as a document

Apply boost to all indexed fields.

Doc.setweight (boost);

Store boost for use by explain and dedup

Doc.add ("Boost", float.tostring (boost));

Reporter.incrcounter ("Indexerstatus", "Documents added", 1);

Nutchindexaction action = new Nutchindexaction (doc, Nutchindexaction.add);

Output.collect (key, action);

}

The current study to this, the following to be continued.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.