Author: Jiangnan Baiyi
Nutch is a complete network search engine solution based on Lucene, similar to Google. The hadoop-based distributed processing model ensures the system performance, and the plug-in mechanism similar to eclipse ensures that the system is customizable, and it is easy to integrate into your own applications.
Nutch 0.8 completely uses hadoop to rewrite the backbone Code. In addition, many reasonable corrections have been made, which is worth upgrading.
1. Install and run nutch 0.8.
The Chinese installation documentation for nutch 0.7.2 is full. For the installation documentation for nutch 0.8, see tutorial (0.8). Pay attention to the following two points:
First, the URLs parameter in the crawl command is changed from the specified file to the specified directory, that is, the original URLs must be changed to URLs/Foo.
Second, the HTTP. Agent. Name property in the nutch-default.xml is empty by default and must be set to this property in the nutch-site.xml; otherwise, an error occurs.
Note that log4j is used to output the information during crawling in the/logs directory. By default, it is no longer directly output to the screen unless you set Fetcher. verbose to true in the configuration file.
Luke (http://www.getopt.org/luke) is a must-have index reading tool.
In addition, nutch needs to run on Unix. to install it on Windows, you can install cygwin first. (Download the local setup.exe online installation is completed soon ).
At last, the recawl script of nutch 0.8 is different.
2. A 2.1 copy of The nutch you shoshould know document
There are not many documents in the nutch, and most of them are installation documents. To go deep into the nutch, you must read them without missing a word:
Introduction to nutch, Part 1 crawler and introduction to nutch, Part 2 searching
Then I read the source code. Fortunately, the source code of nutch is very uniform, brief, and not fancy, so it is easy to understand things.
2.2 three directories
First, understand the three data directories of the nutch:
1. crawdb, linkdbIt is the web link directory that stores the interconnection relationship between URLs. It serves as the basis for crawling and re-crawling. The page expires in 30 days by default.
2. segmentsIt is the main directory that stores the captured webpage. The page content is in the format of bytes [] raw content and parsed text. Nutch crawls Based on the breadth-first principle. Therefore, each round of crawling generates a Segment directory.
3. IndexIs the Lucene index directory, which is the complete index after all indexes in indexs are merged. Note that the index file only indexes the page content and is not stored, therefore, you must access the segments directory to obtain the page content.
2.3 crawling Process
The crawling process has been described in detail in introduction to nutch, Part 1 crawler. Maybe you can directly look at the crawl class to understand the crawling process.
Here is a more intuitive figure:
The entry address, address regular expression, and search depth are used to limit
Because hadoop is used (later), the Code of nutch is written in the hadoop mode to obtain distributed capabilities. Therefore, you must first understand hadoop, understand Mapper, rerducer, inputformat, the role of the outputformat class can be better read.
1. fetcher classRun fetcherthread in multiple threads in run (), call the appropriate protocol plug-in (supporting HTTP, FTP, and other protocols) to obtain the content, and call the appropriate parser to analyze the content as text, put the content in the fetcheroutput class, and finally define the process of writing the disk to the segments by the fetcheroutputformat class.
2.Indexer class, Use hadoop to traverse all the segments directories, serialize the parsedata file to the parsedata class, obtain various data, call the plug-in for indexing, and finally write the index by the ouputformat class.
Note: If you only want to use the Web Crawler of nutch, rather than its indexing function, you can rewrite your own implementation like indexer. For example, you can directly move the segments content into the database.
3. The fields of each index record in the nutch
URL: It is a unique tag value generated by the basicindexingfilter class.
Segment: Generated by the indexer class. The page content captured by nutch is placed in the segments directory. Lucene only indexes and does not store the original content. Therefore, you must use segment and URL as the foreign key during query, the fetchedsegments class obtains content from the segments directory based on hitsdetail.
Boost: Priority, which is calculated by the indexer class calling plug-in.
Title: Displays the title, which is indexed and stored in the basicindexingfilter plug-in.
Content: Mainly searched items, which are indexed in the basicindexingfilter plug-in.
2.4 search process
A typical code is as follows:
NutchBean bean = new NutchBean();
Query query = Query.parse(args[0]);
Hits hits = bean.search(query, NUM_HITS,"title",true);
for (int i = 0; i < hits.getLength(); i++) {
Hit hit = hits.getHit(i);
HitDetails details = bean.getDetails(hit);
String title = details.getValue("title");
String url = details.getValue("url");
String summary =bean.getSummary(details, query);
}
Here, nutchbean has done a few things for us:
First, sort by title Field
Second, distributed query is supported. If servers is configured, The hadoop IPC system will be used to call the nutchbeans on all servers, and the overall results will be finalized.
Third, each site only displays one page with the highest score. If you want to view other results of the same site, you need to access morehitsexculde [].
The fourth is to generate a summary, read the segments directory, obtain the content by segments and URL, and extract the document fragments containing keywords according to certain algorithms.
3. modify source code or compile plug-ins
The source code of nutch is easy to modify and recompile, note that the newly compiled class should be compressed back to the nutch-0.8.job (actually a jar) to take effect.
The plug-in mechanism and degree of nutch are similar to eclipse. For details, refer to idea.
<Plugin id = "index-basic" version = "1.0.0" provider-name = "nutch.org">
<Runtime>
<Library name = "index-basic.jar">
<Export name = "*"/>
</Library>
</Runtime>
<Requires>
<Import plugin = "nutch-extensionpoints"/>
</Requires>
<Extension id = "org. Apache. nutch. indexer. Basic"
Name = "nutch basic indexing filter"
Point = "org. Apache. nutch. indexer. indexingfilter">
<Implementation id = "basicindexingfilter" class = "org. Apache. nutch. indexer. Basic. basicindexingfilter"/>
</Extension>
</Plugin>
Finally, gossip. Doug cutting, translated by Comrade dedian, talked about the development of search engines.