Nutch Index source Code Resolution II)

Source: Internet
Author: User
Keywords nbsp; source code this implementation down
Nutch Index source Code Interpretation II Blog Category: Large data processing research Ntuchsolrhadoop crawler

Then the previous document ~~~~~

In the previous article, Nutch used an Mr to clean and encapsulate a crawled document into a list of action.

The next step is to describe how the crawled data is pushed to SOLR.

----------------------------------------------------Gorgeous split Line---------------------------------------------

Ntuch customizes a indexeroutputformat that inherits from the Fileoutputformat Override Getrecordwriter method

The code is as follows:

public class Indexeroutputformat extends Fileoutputformat<text, nutchindexaction> {

@Override

Public Recordwriter<text, nutchindexaction> Getrecordwriter (filesystem ignored,

jobconf job, String name, progressable progress) throws IOException {

...

}

Method inside the call Nutch a custom interface Nutchindexwriter, we look at one of his implementation class Solrwriter, we mainly look at his close method, the code is as follows:

public void Close () throws IOException {

try {

if (!inputdocs.isempty ()) {

Log.info ("Indexing" + integer.tostring (inputdocs.size ()) + "documents");

if (Numdeletes > 0) {

Log.info ("Deleting" + integer.tostring (numdeletes) + "documents");

}

Updaterequest req = new Updaterequest ();

Req.add (Inputdocs);

Req.setparams (params);

Req.process (SOLR);

Inputdocs.clear ();

}

Solr.commit ();

catch (Final solrserverexception e) {

Throw Makeioexception (e);

}

At shutdown, Nutch sends a packet of encapsulated data using an HTTP-based post method to send a request packet to the SOLR server, Solr.commit (); This method explains some deviations in the previous article, and SOLR's entire transaction is on the SOLR server side, This is different from the previous transaction, and this method is to send another request to commit the current transaction.

As for the previous index of additions and deletions are implemented in this class, if you want to know more, you can view this class implementation.

At this point, Nutch's entire index process has been basically completed.

Due to the haste of time just introduced the approximate process, if you want to know some of the details, if you are interested can leave a message. or send me a letter in the station.

First time write, welcome to comment ~ ~ ~ ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.