[Nutch source code analysis 2] basic indexing process

Source: Internet
Author: User
Tags solr


1. Relationships between various main classes solrindexerjob extends indexerjob1 and indexerjob: mainly completed 2. solrindexerjob: mainly completed 3. indexutil: there is only one method public nutchdocument index (string key, webpage page) is used to return a SOLR Document Object Based on the webpage information.
Ii. Procedure
View the execution script -- nutch in the nutch and obtain the following information: Elif ["$ command" = "solrindex"]; then
Class = org. Apache. nutch. indexer. SOLR. solrindexerjob
Therefore, the program entry is located in the solrindexerjob class.
(1) org. Apache. nutch. indexer. solrindexerjob1, program Portal
  public static void main(String[] args) throws Exception {    final int res = ToolRunner.run(NutchConfiguration.create(),        new SolrIndexerJob(), args);    System.exit(res);  }
Toolrunner. Run () is used to execute the program. For details, refer to the Basic Principle Analysis of running the hadoop program using toolrunner. Among them, the first parameter is mainly loaded with the relevant parameters of nutch, mainly including hadoop core-default.xml, core-site.xml and nutch-default.xml of nutch.
The second parameter specifies the run (string []) method for running solrindexerjob.
2. Execute the run (string []) method of the solrindexerjob class
  public int run(String[] args) throws Exception {    if (args.length < 2) {      System.err.println("Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]");      return -1;    }    if (args.length == 4 && "-crawlId".equals(args[2])) {      getConf().set(Nutch.CRAWL_ID_KEY, args[3]);    }    try {      indexSolr(args[0], args[1]);      return 0;    } catch (final Exception e) {      LOG.error("SolrIndexerJob: " + StringUtils.stringifyException(e));      return -1;    }  }
First, judge the rationality of the parameter, and then execute the indexsolr (string, string) method.
3. Execute the indexsolr (string, string) method.
public void indexSolr(String solrUrl, String batchId) throws Exception {    LOG.info("SolrIndexerJob: starting");    run(ToolUtil.toArgMap(        Nutch.ARG_SOLR, solrUrl,        Nutch.ARG_BATCH, batchId));    // do the commits once and for all the reducers in one go    getConf().set(SolrConstants.SERVER_URL,solrUrl);    SolrServer solr = SolrUtils.getCommonsHttpSolrServer(getConf());    if (getConf().getBoolean(SolrConstants.COMMIT_INDEX, true)) {      solr.commit();    }    LOG.info("SolrIndexerJob: done.");  }

4. Run the run (Map <...>) method.
@Override  public Map<String,Object> run(Map<String,Object> args) throws Exception {    String solrUrl = (String)args.get(Nutch.ARG_SOLR);    String batchId = (String)args.get(Nutch.ARG_BATCH);    NutchIndexWriterFactory.addClassToConf(getConf(), SolrWriter.class);    getConf().set(SolrConstants.SERVER_URL, solrUrl);    currentJob = createIndexJob(getConf(), "solr-index", batchId);    currentJob.waitForCompletion(true);    ToolUtil.recordJobStatus(null, currentJob, results);    return results;  }


(2) org. Apache. nutch. indexer. indexerjob1. Execute the createindexjob () method.
  protected Job createIndexJob(Configuration conf, String jobName, String batchId)  throws IOException, ClassNotFoundException {    conf.set(GeneratorJob.BATCH_ID, batchId);    Job job = new NutchJob(conf, jobName);    // TODO: Figure out why this needs to be here    job.getConfiguration().setClass("mapred.output.key.comparator.class",        StringComparator.class, RawComparator.class);    Collection<WebPage.Field> fields = getFields(job);    StorageUtils.initMapperJob(job, fields, String.class, NutchDocument.class,        IndexerMapper.class);    job.setNumReduceTasks(0);    job.setOutputFormatClass(IndexerOutputFormat.class);    return job;  }}

2. Execute map-related methods, including setup (), map (), cleanup ()
  public static class IndexerMapper      extends GoraMapper<String, WebPage, String, NutchDocument> {    public IndexUtil indexUtil;    public DataStore<String, WebPage> store;       protected Utf8 batchId;    @Override    public void setup(Context context) throws IOException {      Configuration conf = context.getConfiguration();      batchId = new Utf8(conf.get(GeneratorJob.BATCH_ID, Nutch.ALL_BATCH_ID_STR));      indexUtil = new IndexUtil(conf);      try {        store = StorageUtils.createWebStore(conf, String.class, WebPage.class);      } catch (ClassNotFoundException e) {        throw new IOException(e);      }    }       protected void cleanup(Context context) throws IOException ,InterruptedException {      store.close();    };    @Override    public void map(String key, WebPage page, Context context)    throws IOException, InterruptedException {      ParseStatus pstatus = page.getParseStatus();      if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus)          || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) {        return; // filter urls not parsed      }      Utf8 mark = Mark.UPDATEDB_MARK.checkMark(page);      if (!batchId.equals(REINDEX)) {        if (!NutchJob.shouldProcess(mark, batchId)) {          if (LOG.isDebugEnabled()) {            LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; different batch id (" + mark + ")");          }          return;        }      }           NutchDocument doc = indexUtil.index(key, page);      if (doc == null) {        return;      }      if (mark != null) {        Mark.INDEX_MARK.putMark(page, Mark.UPDATEDB_MARK.checkMark(page));        store.put(key, page);      }      context.write(key, doc);    }  }


3. Call context. Write () because job. setoutputformatclass (indexeroutputformat. Class); So write index ??

(3) Public class indexutil 1. Call the index () method
  public NutchDocument index(String key, WebPage page) {    NutchDocument doc = new NutchDocument();    doc.add("id", key);    doc.add("digest", StringUtil.toHexString(page.getSignature()));    if (page.getBatchId() != null) {      doc.add("batchId", page.getBatchId().toString());    }       String url = TableUtil.unreverseUrl(key);    if (LOG.isDebugEnabled()) {      LOG.debug("Indexing URL: " + url);    }    try {      doc = filters.filter(doc, url, page);    } catch (IndexingException e) {      LOG.warn("Error indexing "+key+": "+e);      return null;    }    // skip documents discarded by indexing filters    if (doc == null) return null;    float boost = 1.0f;    // run scoring filters    try {      boost = scoringFilters.indexerScore(url, doc, page, boost);    } catch (final ScoringFilterException e) {      LOG.warn("Error calculating score " + key + ": " + e);      return null;    }    doc.setScore(boost);    // store boost for use by explain and dedup    doc.add("boost", Float.toString(boost));    return doc;  }


3. Field index in plugin 1. Index of basic fields in public class basicindexingfilter implements indexingfilter

[Nutch source code analysis 2] basic indexing process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.