1. Relationships between various main classes solrindexerjob extends indexerjob1 and indexerjob: mainly completed 2. solrindexerjob: mainly completed 3. indexutil: there is only one method public nutchdocument index (string key, webpage page) is used to return a SOLR Document Object Based on the webpage information.
Ii. Procedure
View the execution script -- nutch in the nutch and obtain the following information: Elif ["$ command" = "solrindex"]; then
Class = org. Apache. nutch. indexer. SOLR. solrindexerjob
Therefore, the program entry is located in the solrindexerjob class.
(1) org. Apache. nutch. indexer. solrindexerjob1, program Portal
public static void main(String[] args) throws Exception { final int res = ToolRunner.run(NutchConfiguration.create(), new SolrIndexerJob(), args); System.exit(res); }
Toolrunner. Run () is used to execute the program. For details, refer to the Basic Principle Analysis of running the hadoop program using toolrunner. Among them, the first parameter is mainly loaded with the relevant parameters of nutch, mainly including hadoop core-default.xml, core-site.xml and nutch-default.xml of nutch.
The second parameter specifies the run (string []) method for running solrindexerjob.
2. Execute the run (string []) method of the solrindexerjob class
public int run(String[] args) throws Exception { if (args.length < 2) { System.err.println("Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]"); return -1; } if (args.length == 4 && "-crawlId".equals(args[2])) { getConf().set(Nutch.CRAWL_ID_KEY, args[3]); } try { indexSolr(args[0], args[1]); return 0; } catch (final Exception e) { LOG.error("SolrIndexerJob: " + StringUtils.stringifyException(e)); return -1; } }
First, judge the rationality of the parameter, and then execute the indexsolr (string, string) method.
3. Execute the indexsolr (string, string) method.
public void indexSolr(String solrUrl, String batchId) throws Exception { LOG.info("SolrIndexerJob: starting"); run(ToolUtil.toArgMap( Nutch.ARG_SOLR, solrUrl, Nutch.ARG_BATCH, batchId)); // do the commits once and for all the reducers in one go getConf().set(SolrConstants.SERVER_URL,solrUrl); SolrServer solr = SolrUtils.getCommonsHttpSolrServer(getConf()); if (getConf().getBoolean(SolrConstants.COMMIT_INDEX, true)) { solr.commit(); } LOG.info("SolrIndexerJob: done."); }
4. Run the run (Map <...>) method.
@Override public Map<String,Object> run(Map<String,Object> args) throws Exception { String solrUrl = (String)args.get(Nutch.ARG_SOLR); String batchId = (String)args.get(Nutch.ARG_BATCH); NutchIndexWriterFactory.addClassToConf(getConf(), SolrWriter.class); getConf().set(SolrConstants.SERVER_URL, solrUrl); currentJob = createIndexJob(getConf(), "solr-index", batchId); currentJob.waitForCompletion(true); ToolUtil.recordJobStatus(null, currentJob, results); return results; }
(2) org. Apache. nutch. indexer. indexerjob1. Execute the createindexjob () method.
protected Job createIndexJob(Configuration conf, String jobName, String batchId) throws IOException, ClassNotFoundException { conf.set(GeneratorJob.BATCH_ID, batchId); Job job = new NutchJob(conf, jobName); // TODO: Figure out why this needs to be here job.getConfiguration().setClass("mapred.output.key.comparator.class", StringComparator.class, RawComparator.class); Collection<WebPage.Field> fields = getFields(job); StorageUtils.initMapperJob(job, fields, String.class, NutchDocument.class, IndexerMapper.class); job.setNumReduceTasks(0); job.setOutputFormatClass(IndexerOutputFormat.class); return job; }}
2. Execute map-related methods, including setup (), map (), cleanup ()
public static class IndexerMapper extends GoraMapper<String, WebPage, String, NutchDocument> { public IndexUtil indexUtil; public DataStore<String, WebPage> store; protected Utf8 batchId; @Override public void setup(Context context) throws IOException { Configuration conf = context.getConfiguration(); batchId = new Utf8(conf.get(GeneratorJob.BATCH_ID, Nutch.ALL_BATCH_ID_STR)); indexUtil = new IndexUtil(conf); try { store = StorageUtils.createWebStore(conf, String.class, WebPage.class); } catch (ClassNotFoundException e) { throw new IOException(e); } } protected void cleanup(Context context) throws IOException ,InterruptedException { store.close(); }; @Override public void map(String key, WebPage page, Context context) throws IOException, InterruptedException { ParseStatus pstatus = page.getParseStatus(); if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus) || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) { return; // filter urls not parsed } Utf8 mark = Mark.UPDATEDB_MARK.checkMark(page); if (!batchId.equals(REINDEX)) { if (!NutchJob.shouldProcess(mark, batchId)) { if (LOG.isDebugEnabled()) { LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; different batch id (" + mark + ")"); } return; } } NutchDocument doc = indexUtil.index(key, page); if (doc == null) { return; } if (mark != null) { Mark.INDEX_MARK.putMark(page, Mark.UPDATEDB_MARK.checkMark(page)); store.put(key, page); } context.write(key, doc); } }
3. Call context. Write () because job. setoutputformatclass (indexeroutputformat. Class); So write index ??
(3) Public class indexutil 1. Call the index () method
public NutchDocument index(String key, WebPage page) { NutchDocument doc = new NutchDocument(); doc.add("id", key); doc.add("digest", StringUtil.toHexString(page.getSignature())); if (page.getBatchId() != null) { doc.add("batchId", page.getBatchId().toString()); } String url = TableUtil.unreverseUrl(key); if (LOG.isDebugEnabled()) { LOG.debug("Indexing URL: " + url); } try { doc = filters.filter(doc, url, page); } catch (IndexingException e) { LOG.warn("Error indexing "+key+": "+e); return null; } // skip documents discarded by indexing filters if (doc == null) return null; float boost = 1.0f; // run scoring filters try { boost = scoringFilters.indexerScore(url, doc, page, boost); } catch (final ScoringFilterException e) { LOG.warn("Error calculating score " + key + ": " + e); return null; } doc.setScore(boost); // store boost for use by explain and dedup doc.add("boost", Float.toString(boost)); return doc; }
3. Field index in plugin 1. Index of basic fields in public class basicindexingfilter implements indexingfilter
[Nutch source code analysis 2] basic indexing process