[Nutch source code analysis 2] basic indexing process

Last Update:2014-08-25 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Relationships between various main classes solrindexerjob extends indexerjob1 and indexerjob: mainly completed 2. solrindexerjob: mainly completed 3. indexutil: there is only one method public nutchdocument index (string key, webpage page) is used to return a SOLR Document Object Based on the webpage information.
Ii. Procedure
View the execution script -- nutch in the nutch and obtain the following information: Elif ["$ command" = "solrindex"]; then
Class = org. Apache. nutch. indexer. SOLR. solrindexerjob
Therefore, the program entry is located in the solrindexerjob class.
(1) org. Apache. nutch. indexer. solrindexerjob1, program Portal

  public static void main(String[] args) throws Exception {    final int res = ToolRunner.run(NutchConfiguration.create(),        new SolrIndexerJob(), args);    System.exit(res);  }

Toolrunner. Run () is used to execute the program. For details, refer to the Basic Principle Analysis of running the hadoop program using toolrunner. Among them, the first parameter is mainly loaded with the relevant parameters of nutch, mainly including hadoop core-default.xml, core-site.xml and nutch-default.xml of nutch.
The second parameter specifies the run (string []) method for running solrindexerjob.
2. Execute the run (string []) method of the solrindexerjob class

  public int run(String[] args) throws Exception {    if (args.length < 2) {      System.err.println("Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]");      return -1;    }    if (args.length == 4 && "-crawlId".equals(args[2])) {      getConf().set(Nutch.CRAWL_ID_KEY, args[3]);    }    try {      indexSolr(args[0], args[1]);      return 0;    } catch (final Exception e) {      LOG.error("SolrIndexerJob: " + StringUtils.stringifyException(e));      return -1;    }  }

First, judge the rationality of the parameter, and then execute the indexsolr (string, string) method.
3. Execute the indexsolr (string, string) method.

public void indexSolr(String solrUrl, String batchId) throws Exception {    LOG.info("SolrIndexerJob: starting");    run(ToolUtil.toArgMap(        Nutch.ARG_SOLR, solrUrl,        Nutch.ARG_BATCH, batchId));    // do the commits once and for all the reducers in one go    getConf().set(SolrConstants.SERVER_URL,solrUrl);    SolrServer solr = SolrUtils.getCommonsHttpSolrServer(getConf());    if (getConf().getBoolean(SolrConstants.COMMIT_INDEX, true)) {      solr.commit();    }    LOG.info("SolrIndexerJob: done.");  }

4. Run the run (Map <...>) method.

@Override  public Map<String,Object> run(Map<String,Object> args) throws Exception {    String solrUrl = (String)args.get(Nutch.ARG_SOLR);    String batchId = (String)args.get(Nutch.ARG_BATCH);    NutchIndexWriterFactory.addClassToConf(getConf(), SolrWriter.class);    getConf().set(SolrConstants.SERVER_URL, solrUrl);    currentJob = createIndexJob(getConf(), "solr-index", batchId);    currentJob.waitForCompletion(true);    ToolUtil.recordJobStatus(null, currentJob, results);    return results;  }

(2) org. Apache. nutch. indexer. indexerjob1. Execute the createindexjob () method.

  protected Job createIndexJob(Configuration conf, String jobName, String batchId)  throws IOException, ClassNotFoundException {    conf.set(GeneratorJob.BATCH_ID, batchId);    Job job = new NutchJob(conf, jobName);    // TODO: Figure out why this needs to be here    job.getConfiguration().setClass("mapred.output.key.comparator.class",        StringComparator.class, RawComparator.class);    Collection<WebPage.Field> fields = getFields(job);    StorageUtils.initMapperJob(job, fields, String.class, NutchDocument.class,        IndexerMapper.class);    job.setNumReduceTasks(0);    job.setOutputFormatClass(IndexerOutputFormat.class);    return job;  }}

2. Execute map-related methods, including setup (), map (), cleanup ()

  public static class IndexerMapper      extends GoraMapper<String, WebPage, String, NutchDocument> {    public IndexUtil indexUtil;    public DataStore<String, WebPage> store;       protected Utf8 batchId;    @Override    public void setup(Context context) throws IOException {      Configuration conf = context.getConfiguration();      batchId = new Utf8(conf.get(GeneratorJob.BATCH_ID, Nutch.ALL_BATCH_ID_STR));      indexUtil = new IndexUtil(conf);      try {        store = StorageUtils.createWebStore(conf, String.class, WebPage.class);      } catch (ClassNotFoundException e) {        throw new IOException(e);      }    }       protected void cleanup(Context context) throws IOException ,InterruptedException {      store.close();    };    @Override    public void map(String key, WebPage page, Context context)    throws IOException, InterruptedException {      ParseStatus pstatus = page.getParseStatus();      if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus)          || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) {        return; // filter urls not parsed      }      Utf8 mark = Mark.UPDATEDB_MARK.checkMark(page);      if (!batchId.equals(REINDEX)) {        if (!NutchJob.shouldProcess(mark, batchId)) {          if (LOG.isDebugEnabled()) {            LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; different batch id (" + mark + ")");          }          return;        }      }           NutchDocument doc = indexUtil.index(key, page);      if (doc == null) {        return;      }      if (mark != null) {        Mark.INDEX_MARK.putMark(page, Mark.UPDATEDB_MARK.checkMark(page));        store.put(key, page);      }      context.write(key, doc);    }  }

3. Call context. Write () because job. setoutputformatclass (indexeroutputformat. Class); So write index ??

(3) Public class indexutil 1. Call the index () method

  public NutchDocument index(String key, WebPage page) {    NutchDocument doc = new NutchDocument();    doc.add("id", key);    doc.add("digest", StringUtil.toHexString(page.getSignature()));    if (page.getBatchId() != null) {      doc.add("batchId", page.getBatchId().toString());    }       String url = TableUtil.unreverseUrl(key);    if (LOG.isDebugEnabled()) {      LOG.debug("Indexing URL: " + url);    }    try {      doc = filters.filter(doc, url, page);    } catch (IndexingException e) {      LOG.warn("Error indexing "+key+": "+e);      return null;    }    // skip documents discarded by indexing filters    if (doc == null) return null;    float boost = 1.0f;    // run scoring filters    try {      boost = scoringFilters.indexerScore(url, doc, page, boost);    } catch (final ScoringFilterException e) {      LOG.warn("Error calculating score " + key + ": " + e);      return null;    }    doc.setScore(boost);    // store boost for use by explain and dedup    doc.add("boost", Float.toString(boost));    return doc;  }

3. Field index in plugin 1. Index of basic fields in public class basicindexingfilter implements indexingfilter

[Nutch source code analysis 2] basic indexing process

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More