Nutch 0.8 notes Google-style search engine implementation

Source: Internet
Author: User
Abstract: Nutch is a Lucene-based, similar to Google's complete network search engine solution. The hadoop-based distributed processing model ensures system performance, similar
The plug-in mechanism of eclipse ensures that the system is customizable and can be easily integrated into your own applications. Nutch 0.8
The backbone code is completely rewritten using hadoop, and many other places have made reasonable corrections, which is worth upgrading. 1. Install and run nutch 0.8
The 0.7.2 installation documents are all on the street. For the installation documents of nutch 0.8, see tutorial (0.8). Pay attention to the following two points:
The URLs parameter in the crawl command changes from the specified file to the specified directory, that is, the original URLs must be changed to URLs/F.
Keywords:

Nutch
(69)
Search Engine
(33)

Update: 2008-03-23
Read: 775 times
Category:
Body: Nutch
Is a Lucene-based
, Similar to Google's complete network search engine solution, based on hadoop
The distributed processing model ensures the system performance. The plug-in mechanism similar to eclipse ensures that the system is customizable and can be easily integrated into your own applications.

Nutch 0.8 completely uses hadoop to rewrite the backbone code, and many other places have made reasonable corrections.
And is worth upgrading.

1. Install and run nutch 0.8.

The Chinese installation documentation of nutch 0.7.2 is full. For the installation documentation of nutch 0.8, see tutorial (0.8)
Pay attention to the following two points:

First, the URLs parameter in the crawl command is changed from the specified file to the specified directory, that is, the original URLs must be changed to URLs/Foo.

Second, the HTTP. Agent. Name property in the nutch-default.xml is empty by default and must be set to this property in the nutch-site.xml; otherwise, an error occurs.

Note that log4j is used to output the information during crawling in the/logs directory. By default, it is no longer directly output to the screen unless you set Fetcher. verbose to true in the configuration file.

Http://www.getopt.org/luke (Luke)
It is a required index reading tool.

In addition, nutch needs to run on Unix. to install it on Windows, you can install cygwin first. (Download the local setup.exe online installation is completed soon ).

Finally, the recawl script of nutch 0.8
It is also different.

2. nutch you shoshould know
2.1 Documents

There are not many documents in the nutch, and most of them are installation documents. To go deep into the nutch, you must read them without missing a word:

Introduction to nutch, Part 1 Crawler
And introduction to nutch, Part 2 searching

Then I read the source code. Fortunately, the source code of nutch is very uniform, brief, and not fancy, so it is easy to understand things.

2.2 three directories

First, understand the three data directories of the nutch:

1. crawdb, linkdb
It is the web link directory that stores the interconnection relationship between URLs. It serves as the basis for crawling and re-crawling. The page expires in 30 days by default.

2. segments
It is the main directory that stores the captured webpage. The page content is in the format of bytes [] raw content and parsed text. Nutch crawls Based on the breadth-first principle. Therefore, each round of crawling generates a Segment directory.

3. Index
Is the Lucene index directory, which is the complete index after all indexes in indexs are merged. Note that the index file only indexes the page content and is not stored, therefore, you must access the segments directory to obtain the page content.

2.3 crawling Process

Crawling process in introduction to nutch, Part 1 crawling
For more information, see the crawl class to understand the crawling process.

Here is a more intuitive figure:

The entry address, address regular expression, and search depth are used to limit the number of entries.

Because hadoop is used (later), the Code of nutch is written in the hadoop mode to obtain the distributed capability. Therefore, you must first understand hadoop and understand Mapper, CER, and inputformat, the role of the outputformat class can be better read.

1. fetcher class
,
Run fetcherthread in multiple threads in run (), call the appropriate protocol plug-in (supporting HTTP, FTP, and other protocols) to obtain the content, and call the appropriate
Parser analyzes the content (HTML, PDF, Excel) as text, puts the content in the fetcheroutput class, and
The fetcheroutputformat class defines the process of writing a disk to a segments.

2.Indexer class
, Use hadoop to traverse all the segments directories, serialize the parsedata file to the parsedata class, obtain various data, call the plug-in for indexing, and finally write the index by the ouputformat class.

Note: If you only want to use the Web Crawler of nutch, rather than its indexing function, you can rewrite your own implementation like indexer. For example, you can directly move the segments content into the database.

3. The fields of each index record in the nutch

URL
: It is a unique tag value generated by the basicindexingfilter class.

Segment
: Generated by the indexer class. The content of the page captured by nutch is placed in the segments directory, Lucene
Only index, not store the original content. Therefore, you must use the segment and URL as the foreign key during the query. The fetchedsegments class uses
The segments directory obtains the content.

Boost
: Priority, which is calculated by the indexer class calling plug-in.

Title
: Displays the title, which is indexed and stored in the basicindexingfilter plug-in.

Content
: Mainly searched items, which are indexed in the basicindexingfilter plug-in.

2.4 search process

A typical code is as follows:

     NutchBean bean = new NutchBean();
    Query query = Query.parse(args[0]);
    Hits hits = bean.search(query, NUM_HITS,"title",true);

    for (int i = 0; i < hits.getLength(); i++) {
      Hit hit = hits.getHit(i);
      HitDetails details = bean.getDetails(hit);

      String title = details.getValue("title");
      String url = details.getValue("url");
      String summary =bean.getSummary(details, query);
    }
Here, nutchbean has done a few things for us:

First, sort by title field.

Second, distributed query is supported. If servers is configured, hadoop's IPC is used.
The system calls the nutchbeans on all servers, and finally the overall result is specified.

Third, each site only displays one page with the highest score like Google. If you want to view other results of the same site, you need to call the API for further access.

The fourth is to generate a summary, get the content from the segments directory by segments and URL, and extract the file fragments that contain keywords like Google according to certain algorithms.

3. modify source code or compile plug-ins

The source code of nutch is easy to modify and recompile, note that the newly compiled class should be compressed back to the nutch-0.8.job (actually a jar) to take effect.

The plug-in mechanism and degree of nutch is similar to eclipse, look at the http://wiki.apache.org/nutch/WritingPluginExample
To implement a plug-in interface, and then define the class, extension points, and dependent jar in plugins. XML, such


=
"
Index-Basic
"
Version
=
"
1.0.0
"
Provider-name
=
"
Nutch.org
"
>


=
"
Index-basic.jar
"
>

=
"
*
"
/>




=
"
Nutch-extensionpoints
"
/>


=
"
Org. Apache. nutch. indexer. Basic
"

Name
=
"
Nutch basic indexing Filter
"
 
Point
=
"
Org. Apache. nutch. indexer. indexingfilter
"
>

=
"
Basicindexingfilter
"
Class
=
"
Org. Apache. nutch. indexer. Basic. basicindexingfilter
"
/>

Finally, gossip: Doug cutting, translated by Comrade dedian -- about the development of search engines
.
Finally, I would like to thank the translator who created the phrase "C ++ must know :)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.