Introduction to Lucene and Nutch

Source: Internet
Author: User
Keywords Nutch lucene
Tags analysis apache developers file files full-text search full-text search engine index

Lucene

Introduction

Lucene is a subproject of the Apache Software Foundation 4 Jakarta Project Team, an open source Full-text search Engine toolkit, which is not a full Full-text search engine, but a full-text search engine architecture that provides a complete query engine and indexing engine, part of the text analysis engine ( Two Western languages, English and German. The purpose of Lucene is to provide software developers with an Easy-to-use toolkit to facilitate full-text retrieval in the target system, or to build a complete full-text search engine on this basis.

Advantages

The index file format is independent of the application platform. Lucene defines a set of 8-byte indexed file formats that enable compatible systems or applications from different platforms to share established index files.

Based on the inverted index of the traditional Full-text search engine, a block index is implemented, which can set up small file indexes for new files and improve indexing speed. Then, through merging with the original index, the purpose of optimization is achieved.

The text analysis interface is designed, which is independent of language and file format, and the indexer completes the indexing file by accepting the token stream, the user expands the new language and file format, and only needs to implement the interface of text analysis.

has been implemented by default a powerful query engine, users do not have to write their own code to enable the system to obtain a powerful query capabilities, Lucene query implementation of the default implementation of the Boolean operation, fuzzy query (vague search[11]), grouped query and so on.

Import Jar Packs

7 packages need to be imported: Analysis,document,index,queryparser,search,store,util

Nutch

Introduction

Nutch is an open source Java implementation of the search engine. It provides all the tools we need to run our own search engine. Includes Full-text search and web crawler.

Composition

Reptile Crawler and Query searcher. Crawler is primarily used to crawl and index Web pages from the Web. Searcher mainly use these indexes to retrieve the user's search keywords to produce the search results. The interface between the two is an index, so the coupling between the two is very low because the index part is removed.

The main purpose of crawler and searcher is to separate the two parts so that they can be distributed on the hardware platform, for example, by placing crawler and searcher on two hosts respectively, which can improve performance.

Reptile Crawler

Crawler focuses on two aspects, crawler workflow and the format and implications of the data file involved. Data files mainly include three categories, namely the Web database, a series of segment plus index, the physical files of the three are stored in the crawl results directory under the DB directory WebDB subfolder, Segments folder and Index folder. So what are the information stored separately by the three?

A crawl produces many segment, each of which stores the Web pages that the Crawler crawler captured in a single crawl loop and the index of those pages. Crawler crawl will be based on the link in the WEBDB in accordance with a certain crawling strategy to generate each crawl cycle of the required fetchlist, and then fetcher through the URLs in Fetchlist crawl these pages and index, and then deposit it into segment. Segment is a time limit, when these pages are crawler crawled, the previous crawl generated segment is void. In storage. The segment folder is named after the creation of the time, so that we remove the obsolete segments to save storage space.

Index is the crawler of all Web pages that are crawled by merging the indexes in all individual segment. Nutch uses the Lucene technique for indexing, so the interfaces that operate on the indexes in Lucene are equally valid for index Nutch. However, it is important to note that the segment and Nutch in Lucene are a part of the index, but the segment in Nutch is only the content and index of the pages in the WEBDB. Finally, the resulting index has nothing to do with these segment.

The

Web database, also known as WEBDB, stores the link structure information between crawled Web pages, which is only used in crawler work and has nothing to do with Searcher's work. WebDB stores information about two entities: page and link. Page entities represent an actual web page by describing the characteristics of a page on a network, because there are many needs to describe the Web pages, and the page entities are indexed by the URL of the Web page and the MD5 two indexing methods for the content of the Web page webdb. Page entity describes the characteristics of the Web pages mainly include the number of link in the Web page, the time to crawl this page and other related crawl information, the importance of this page rating. Similarly, the link entity describes the links between two page entities. The WebDB form a link chart of the crawled Web page where the page entity is the node of the graph, and the link entity represents the edges of the graph.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.