Nutch Index source Code Interpretation II Blog Category: Large data processing research Ntuchsolrhadoop crawler
Then the previous document ~~~~~
In the previous article, Nutch used an Mr to clean and encapsulate a crawled document into a list of action.
The next step is to describe how the crawled data is pushed to SOLR.
----------------------------------------------------Gorgeous split Line---------------------------------------------
Ntuch customizes a indexeroutputformat that inherits from the Fileoutputformat Override Getrecordwriter method
The code is as follows:
public class Indexeroutputformat extends Fileoutputformat<text, nutchindexaction> {
@Override
Public Recordwriter<text, nutchindexaction> Getrecordwriter (filesystem ignored,
jobconf job, String name, progressable progress) throws IOException {
...
}
Method inside the call Nutch a custom interface Nutchindexwriter, we look at one of his implementation class Solrwriter, we mainly look at his close method, the code is as follows:
public void Close () throws IOException {
try {
if (!inputdocs.isempty ()) {
Log.info ("Indexing" + integer.tostring (inputdocs.size ()) + "documents");
if (Numdeletes > 0) {
Log.info ("Deleting" + integer.tostring (numdeletes) + "documents");
}
Updaterequest req = new Updaterequest ();
Req.add (Inputdocs);
Req.setparams (params);
Req.process (SOLR);
Inputdocs.clear ();
}
Solr.commit ();
catch (Final solrserverexception e) {
Throw Makeioexception (e);
}
At shutdown, Nutch sends a packet of encapsulated data using an HTTP-based post method to send a request packet to the SOLR server, Solr.commit (); This method explains some deviations in the previous article, and SOLR's entire transaction is on the SOLR server side, This is different from the previous transaction, and this method is to send another request to commit the current transaction.
As for the previous index of additions and deletions are implemented in this class, if you want to know more, you can view this class implementation.
At this point, Nutch's entire index process has been basically completed.
Due to the haste of time just introduced the approximate process, if you want to know some of the details, if you are interested can leave a message. or send me a letter in the station.
First time write, welcome to comment ~ ~ ~ ~