Nutch Index Source Code Analysis (i) blog Category: Large data processing Research Nutchsolrhadoop index
Introduction of indexing method for Nutch integrated Slor
/**
* Indexing
* @param solrurl SOLR web address
* @param crawldb Crawl db storage path: \CRAWL\CRAWLDB
* @param linkdb Crawl link storage path: \CRAWL\LINKDB
* @param segments Metadata storage path: \crawl\segments
* @param nocommit Submit Slor Server Slor Index
* @param deletegone Delete obsolete documents
* @param solrparams SOLR parameters
* @param filter to enable URL filtering
* @param normalize format URL
* @throws IOException
*/
public void Indexsolr (String solrurl, Path crawldb, Path LINKDB,
List<path> segments, Boolean Nocommit, Boolean deletegone, String solrparams,
Boolean filter, Boolean normalize) throws IOException {
...
Indexermapreduce.initmrjob (Crawldb, LINKDB, segments, job);
...
}
The index of Nutch is accomplished by a Mr.
The input of map is the sequencefile of the Nutch crawl directory, the key is the Java generics used by Nutch crawl down url,value, and all Nutch custom data types are abstracted into a nutchwritable object.
The data types contained in nutchwritable are the following source code:
CLASSES = new class[] {
Org.apache.hadoop.io.NullWritable.class,
Org.apache.hadoop.io.BooleanWritable.class,
Org.apache.hadoop.io.LongWritable.class,
Org.apache.hadoop.io.BytesWritable.class,
Org.apache.hadoop.io.FloatWritable.class,
Org.apache.hadoop.io.IntWritable.class,
Org.apache.hadoop.io.MapWritable.class,
Org.apache.hadoop.io.Text.class,
Org.apache.hadoop.io.MD5Hash.class,
Org.apache.nutch.crawl.CrawlDatum.class,
Org.apache.nutch.crawl.Inlink.class,
Org.apache.nutch.crawl.Inlinks.class,
Org.apache.nutch.fetcher.FetcherOutput.class,
Org.apache.nutch.metadata.Metadata.class,
Org.apache.nutch.parse.Outlink.class,
Org.apache.nutch.parse.ParseText.class,
Org.apache.nutch.parse.ParseData.class,
Org.apache.nutch.parse.ParseImpl.class,
Org.apache.nutch.parse.ParseStatus.class,
Org.apache.nutch.protocol.Content.class,
Org.apache.nutch.protocol.ProtocolStatus.class,
Org.apache.nutch.scoring.webgraph.LinkDatum.class,
};
These data types abstract the data types of the Nutch at each stage of the crawl.
The map phase does not process the value, and only the URL is processed, and the processing code is as follows:
String urlstring = Filterurl (Normalizeurl (key.tostring ()));
The call is to filter and format URLs based on well-defined filtering rules, and of course it can be set by invoking the command.
Reduce handles all crawled data, with code comments as follows:
/**
* Output format: URL as key, index action as value
*/
public void reduce (Text key, iterator<nutchwritable> values,
Outputcollector<text, nutchindexaction> output, Reporter Reporter)
Throws IOException {
InLinks inlinks = null;
Crawldatum dbdatum = null;
Crawldatum fetchdatum = null;
Parsedata parsedata = null;
ParseText parsetext = null;
while (Values.hasnext ()) {
Final writable value = Values.next (). get (); Unwrap
If the data type of the URL injection
if (value instanceof InLinks) {
InLinks = (inlinks) value;
If the data type is crawled
else if (value instanceof crawldatum) {
Final Crawldatum datum = (crawldatum) value;
If the current data is in DB injection state
if (Crawldatum.hasdbstatus (datum)) {
Dbdatum = Datum;
}
If the current data is in a crawl completion state.
else if (Crawldatum.hasfetchstatus (datum)) {
Don ' t index unmodified (empty) pages
To determine whether the climb has been modified
if (Datum.getstatus ()!= crawldatum.status_fetch_notmodified) {
Fetchdatum = Datum;
/**
* Check If we need to delete 404 Not FOUND and permanent REDIRECT.
*/
Delete Error and obsolete pages if setting is true in Parameters
if (delete) {
If the crawled page expires, take a delete operation.
if (fetchdatum.getstatus () = = Crawldatum.status_fetch_gone) {
Reporter.incrcounter ("Indexerstatus", "Documents deleted", 1);
Nutchindexaction action = new nutchindexaction (null, nutchindexaction.delete);
Output.collect (key, action);
Return;
}
If the crawled page has been redirected to another page, delete the operation.
if (fetchdatum.getstatus () = = Crawldatum.status_fetch_redir_perm) {
Reporter.incrcounter ("Indexerstatus", "Perm redirects deleted", 1);
Nutchindexaction action = new nutchindexaction (null, nutchindexaction.delete);
Output.collect (key, action);
Return;
}
}
}
URLs are found via other URLs | | Page Signature | | The metadata for the page is generated by the parser
else if (crawldatum.status_linked = Datum.getstatus () | |
Crawldatum.status_signature = = Datum.getstatus () | |
Crawldatum.status_parse_meta = = Datum.getstatus ()) {
re-enters;
else {
throw new RuntimeException ("Unexpected Status:" +datum.getstatus ());
}
If the data type is resolved
else if (value instanceof Parsedata) {
Parsedata = (parsedata) value;
Handle a meta? https://issues.apache.org/jira/browse/NUTCH-1434
if (Deleterobotsnoindex) {
Get the Robots meta data
String Robotsmeta = Parsedata.getmeta ("robots");
Super-delegates it a noindex for this URL?
if (Robotsmeta!= null && robotsmeta.tolowercase (). IndexOf ("Noindex")!=-1) {
Delete it!
Nutchindexaction action = new nutchindexaction (null, nutchindexaction.delete);
Output.collect (key, action);
Return;
}
}
Parsed text file
else if (value instanceof ParseText) {
ParseText = (parsetext) value;
else if (log.iswarnenabled ()) {
Log.warn ("Unrecognized type:" +value.getclass ());
}
}
If only links, no crawl history or crawl data are returned directly
if (fetchdatum = null | | dbdatum = NULL
|| ParseText = NULL | | Parsedata = = null) {
Return; Only have InLinks
}
Whether to skip db_notmodified pages
If the page has been crawled, but no repairs have been made, skipping is set in the incoming command.
if (Skip && dbdatum.getstatus () = crawldatum.status_db_notmodified) {
Reporter.incrcounter ("Indexerstatus", "skipped", 1);
Return;
}
Page crawled to success, but parsing failed, direct return
if (!parsedata.getstatus (). issuccess () | |
Fetchdatum.getstatus ()!= crawldatum.status_fetch_success) {
Return;
}
Nutchdocument doc = new nutchdocument ();
Get the metadata for the page in the parsed data
Final Metadata Metadata = Parsedata.getcontentmeta ();
Add segment, used to map in merged index back to segment files
Doc.add ("segment", Metadata.get (Nutch.segment_name_key));
Page Summary
Add digest, used by dedup
Doc.add ("Digest", Metadata.get (Nutch.signature_key));
Final Parse Parse = new Parseimpl (ParseText, parsedata);
try {
Extract information from Dbdatum
Fetchdatum so this indexing filters can use it
Final Text url = (Text) dbdatum.getmetadata (). get (Nutch.writable_repr_url_key);
if (URL!= null) {
Fetchdatum.getmetadata (). Put (Nutch.writable_repr_url_key, URL);
}
Run indexing filters
Execute all Filters
doc = This.filters.filter (doc, Parse, key, Fetchdatum, InLinks);
catch (Final indexingexception e) {
if (log.iswarnenabled ()) {Log.warn ("Error Indexing" +key+ ":" +e);}
Reporter.incrcounter ("Indexerstatus", "Errors", 1);
Return;
}
Skip documents discarded by indexing filters
if (doc = = null) {
Reporter.incrcounter ("Indexerstatus", "Skipped by Filters", 1);
Return;
}
float boost = 1.0f;
Run Scoring filters
Perform a grading filter
try {
Boost = This.scfilters.indexerScore (key, Doc, Dbdatum,
Fetchdatum, Parse, InLinks, boost);
catch (Final scoringfilterexception e) {
if (log.iswarnenabled ()) {
Log.warn ("Error Calculating score" + Key + ":" + E);
}
Return;
}
Weighting the score as a document
Apply boost to all indexed fields.
Doc.setweight (boost);
Store boost for use by explain and dedup
Doc.add ("Boost", float.tostring (boost));
Reporter.incrcounter ("Indexerstatus", "Documents added", 1);
Nutchindexaction action = new Nutchindexaction (doc, Nutchindexaction.add);
Output.collect (key, action);
}
The current study to this, the following to be continued.