Search for those things-a detailed description of lucene (2) lucene search program components

Source: Internet
Author: User


For a search program like lucene, it is necessary to first understand its entire component structure. Now we have a simple understanding of it as a whole, and then break through the learning one by one. Many beginners think lucene is a complete search program. In fact, this understanding is wrong. It is actually only part of the core index and search module of the search program. As we mentioned earlier, Lucene has two indexing and search processes, including index creation, index, and search. Let's take a closer look at the components and workflow of Lucene:

650) this. width = 650; "src =" http://img.blog.csdn.net/20131015111454375? Watermark/2/text/plain/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA ==/ dissolve/70/gravity/Center "style =" border: none; "alt =" Center "/>

Next, let's take a look at the two most important components of lucene.

I. index components

You can use indexes to quickly access specific information in data. An index is a structure that sorts the values of one or more columns in a data record. An index is a separate and physical data structure, it is a set of values in one or more columns in a record and a logical pointer list pointing to the data page that physically identifies these values in the table. This will help you obtain this information more quickly. Imagine that if there is no index, we can find the records of a file. The simplest way to think about it is to search records in sequence. If the data volume is small, there is nothing left, if the data volume reaches millions or tens of millions, you can imagine the search time. To use indexes in lucene, you must create an index on the text file to convert the text content into a file format that can be quickly searched. This eliminates the impact of low efficiency caused by slow sequential scanning. You can think of indexes as a data structure that provides a mechanism for Random Access to text files. Next, let's take a look at the entire indexing process.

1. Get content

Lucene does not provide tools or components for obtaining content. The content requires developers to provide their own programs. This step involves searching and defining the content to be indexed using web crawlers or spider programs. Of course, data sources may include databases, distributed file systems, and local xml. Lucene, as a core Search Library, does not provide any function to obtain content. Currently, there are a large number of open-source crawler software to implement this function, such as Solr and lucene sub-item; Nutch and apache projects, including large-scale crawler tools to capture and distinguish web site data; Grub, popular open-source web Crawler tools; Heritrix, an open-source Internet document search program; Aperture, supports crawling from websites, file systems, and mailboxes, parse and index the text data.

After obtaining the content, let's take a look at how to create a small data block based on the obtained content, which is also a document.

2. Create a document

After obtaining the original content, you must index the content and convert the content into a part document ). The document mainly includes several fields with values, such as title, body, abstract, author, and link. If the document and domain are important, you can also add weights. After the solution is designed, You need to extract the text from the original content and write it into each document. In this step, you can use the document filter. Open-source projects such as Tika can implement good document filtering. If the raw content to be obtained is stored in the database, some projects can easily perform operations and searches on the database table by means of the seamless link to the Content acquisition steps and document creation steps, for example, DBSight, Hibernate Search, LuSQL, Compass, and Oracle/Lucene integration projects.

3. Document Analysis

Search engines cannot directly index text: text must be divided into a series of independent atomic elements called vocabulary units. Each vocabulary unit corresponds to a word in a language. This step determines how text fields in a document are divided into a vocabulary unit series. Lucene provides a large number of embedded analyzer to easily control this step.

4. Document Index

Add the document to the index list. Lucene provides a strong file API in this step. You only need to call the provided methods to create a document index.

To provide a good user experience, indexes must be well handled: when designing and customizing index programs, you must focus on how to improve the user's search experience.

Ii. Search Components

The search component is used to enter a search phrase, perform word segmentation, and then search for a word from the index to find the document containing the word. The search quality is measured by the precision and the recall rate. The search details are still complex. This is one of the main things we will discuss about lucene later. In particular, the search speed and the ability to search for large volumes of data are important in the search technology. The search component mainly includes the following:

1.User search interface: The page that interacts with the user, that is, what can be seen in the browser. Here, we mainly consider page uidesign. A good UI design is an important part of attracting users.

2.Create a query: Creating a query mainly refers to the phrase that the user enters to query and submits it to the backend server in the form of common HTML or ajax. Then, the words are passed to the background search engine. This is a simple query process.

3.Search and query: Query the index and return the document that matches the query word. Then, sort the returned structure according to the query request. The search and query component covers most of the complex content in the search engine.

The following are common search theory models:

650) this. width = 650; "src =" http://img.blog.csdn.net/20131015111458937? Watermark/2/text/plain/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA ==/ dissolve/70/gravity/Center "style =" border: none; "alt =" Center "/>

4.Display Results: Displays results, similar to the first search interface. Is a front-end display page that interacts with users. As a search engine, user experience is always the first. Front-end display plays an important role in user embodiment.

OK. The above mainly describes two important components of the search program. Here is a brief introduction. We will introduce them in detail in future blog posts. Finally, let's take a brief look at the APIS provided by lucene in these two components.

650) this. width = 650; "src =" http://img.blog.csdn.net/20131015112245234? Watermark/2/text/plain/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA ==/ dissolve/70/gravity/Center "style =" border: none; "alt =" Center "/>

The figure below briefly explains:

1. the indexed Document is represented by a Document object.

2. IndexWriter adds the document to the index through the addDocument function to create the index.

3. Lucene indexes are reverse indexes.

4. When a user queries a request, Query represents the user's Query statement.

5. IndexSearcher searches Lucene Index through the search function

6. IndexSearcher calculates the Term Weight and Score and returns the result to the user.

7. The collection of documents returned to the user is represented by TopDocsCollector.


This article from "Cao Sheng Huan" blog, please be sure to keep this source http://javacsh.blog.51cto.com/3545281/1309131

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.