Nutch 0.8 notes-Google-based search engine implementation (Author: Jiangnan Baiyi)

Source: Internet
Author: User
Author: Jiangnan Baiyi

Nutch is a complete network search engine solution based on Lucene, similar to Google. The hadoop-based distributed processing model ensures the system performance, and the plug-in mechanism similar to eclipse ensures that the system is customizable, and it is easy to integrate into your own applications.

Nutch 0.8 completely uses hadoop to rewrite the backbone code, and many other places have made reasonable corrections, which is worth upgrading.

1. Install and run nutch 0.8.

The Chinese installation documentation for nutch 0.7.2 is full. For the installation documentation for nutch 0.8, see tutorial (0.8). Pay attention to the following two points:

First, the URLs parameter in the crawl command is changed from the specified file to the specified directory, that is, the original URLs must be changed to URLs/Foo.

Second, the HTTP. Agent. Name property in the nutch-default.xml is empty by default and must be set to this property in the nutch-site.xml; otherwise, an error occurs.

Note that log4j is used to output the information during crawling in the/logs directory. By default, it is no longer directly output to the screen unless you set Fetcher. verbose to true in the configuration file.

Luke (http://www.getopt.org/luke) is a must-have index reading tool.

In addition, nutch needs to run on Unix. to install it on Windows, you can install cygwin first. (Download the local setup.exe online installation is completed soon ).

At last, the recawl script of nutch 0.8 is different.

2. A 2.1 copy of The nutch you shoshould know document

There are not many documents in the nutch, and most of them are installation documents. To go deep into the nutch, you must read them without missing a word:

Introduction to nutch, Part 1 crawler and introduction to nutch, Part 2 searching

Then I read the source code. Fortunately, the source code of nutch is very uniform, brief, and not fancy, so it is easy to understand things.

2.2 three directories

First, understand the three data directories of the nutch:

1. crawdb, linkdbIt is the web link directory that stores the interconnection relationship between URLs. It serves as the basis for crawling and re-crawling. The page expires in 30 days by default.

2. segmentsIt is the main directory that stores the captured webpage. The page content is in the format of bytes [] raw content and parsed text. Nutch crawls Based on the breadth-first principle. Therefore, each round of crawling generates a Segment directory.

3. IndexIs the Lucene index directory, which is the complete index after all indexes in indexs are merged. Note that the index file only indexes the page content and is not stored, therefore, you must access the segments directory to obtain the page content.

2.3 crawling Process

The crawling process has been described in detail in introduction to nutch, Part 1 crawler. Maybe you can directly look at the crawl class to understand the crawling process.

Here is a more intuitive figure:

The entry address, address regular expression, and search depth are used to limit the number of entries.

Because hadoop is used (later), the Code of nutch is written in the hadoop mode to obtain the distributed capability. Therefore, you must first understand hadoop and understand Mapper, CER, and inputformat, the role of the outputformat class can be better read.

1. fetcher classRun fetcherthread in multiple threads in run (), call the appropriate protocol plug-in (supporting HTTP, FTP, and other protocols) to obtain the content, and call the appropriate parser to convert the content (HTML, PDF, Excel) analyze as text and put the content in the fetcheroutput class. Finally, the fetcheroutputformat class defines the process of writing a disk to segments.

2.Indexer class, Use hadoop to traverse all the segments directories, serialize the parsedata file to the parsedata class, obtain various data, call the plug-in for indexing, and finally write the index by the ouputformat class.

Note: If you only want to use the Web Crawler of nutch, rather than its indexing function, you can rewrite your own implementation like indexer. For example, you can directly move the segments content into the database.

3. The fields of each index record in the nutch

URL: It is a unique tag value generated by the basicindexingfilter class.

Segment: Generated by the indexer class. The page content captured by nutch is placed in the segments directory. Lucene only indexes and does not store the original content. Therefore, you must use segment and URL as the foreign key during query, the fetchedsegments class obtains content from the segments directory based on hitsdetail.

Boost: Priority, which is calculated by the indexer class calling plug-in.

Title: Displays the title, which is indexed and stored in the basicindexingfilter plug-in.

Content: Mainly searched items, which are indexed in the basicindexingfilter plug-in.

2.4 search process

A typical code is as follows:

     NutchBean bean = new NutchBean();
    Query query = Query.parse(args[0]);
    Hits hits = bean.search(query, NUM_HITS,"title",true);

    for (int i = 0; i < hits.getLength(); i++) {
      Hit hit = hits.getHit(i);
      HitDetails details = bean.getDetails(hit);

      String title = details.getValue("title");
      String url = details.getValue("url");
      String summary =bean.getSummary(details, query);
    }
Here, nutchbean has done a few things for us:

First, sort by title field.

Second, distributed query is supported. If servers is configured, The hadoop IPC system will be used to call the nutchbeans on all servers, and the overall results will be finalized.

Third, each site only displays one page with the highest score like Google. If you want to view other results of the same site, you need to call the API for further access.

The fourth is to generate a summary, get the content from the segments directory by segments and URL, and extract the file fragments that contain keywords like Google according to certain algorithms.

3. modify source code or compile plug-ins

The source code of nutch is easy to modify and recompile, note that the newly compiled class should be compressed back to the nutch-0.8.job (actually a jar) to take effect.

The plug-in mechanism and degree of nutch are similar to eclipse. For details, refer to idea.

<Plugin id = "index-basic" version = "1.0.0" provider-name = "nutch.org">
<Runtime>
<Library name = "index-basic.jar">
<Export name = "*"/>
</Library>
</Runtime>
<Requires>
<Import plugin = "nutch-extensionpoints"/>
</Requires>
<Extension id = "org. Apache. nutch. indexer. Basic"
Name = "nutch basic indexing filter"
Point = "org. Apache. nutch. indexer. indexingfilter">
<Implementation id = "basicindexingfilter" class = "org. Apache. nutch. indexer. Basic. basicindexingfilter"/>
</Extension>
</Plugin>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.