International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Java

Search for those things-a detailed description of lucene (2) lucene search program components

Last Update:2013-12-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For a search program like lucene, it is necessary to first understand its entire component structure. Now we have a simple understanding of it as a whole, and then break through the learning one by one. Many beginners think lucene is a complete search program. In fact, this understanding is wrong. It is actually only part of the core index and search module of the search program. As we mentioned earlier, Lucene has two indexing and search processes, including index creation, index, and search. Let's take a closer look at the components and workflow of Lucene:

650) this. width = 650; "src =" http://img.blog.csdn.net/20131015111454375? Watermark/2/text/plain/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA ==/ dissolve/70/gravity/Center "style =" border: none; "alt =" Center "/>

Next, let's take a look at the two most important components of lucene.

I. index components

You can use indexes to quickly access specific information in data. An index is a structure that sorts the values of one or more columns in a data record. An index is a separate and physical data structure, it is a set of values in one or more columns in a record and a logical pointer list pointing to the data page that physically identifies these values in the table. This will help you obtain this information more quickly. Imagine that if there is no index, we can find the records of a file. The simplest way to think about it is to search records in sequence. If the data volume is small, there is nothing left, if the data volume reaches millions or tens of millions, you can imagine the search time. To use indexes in lucene, you must create an index on the text file to convert the text content into a file format that can be quickly searched. This eliminates the impact of low efficiency caused by slow sequential scanning. You can think of indexes as a data structure that provides a mechanism for Random Access to text files. Next, let's take a look at the entire indexing process.

1. Get content

Lucene does not provide tools or components for obtaining content. The content requires developers to provide their own programs. This step involves searching and defining the content to be indexed using web crawlers or spider programs. Of course, data sources may include databases, distributed file systems, and local xml. Lucene, as a core Search Library, does not provide any function to obtain content. Currently, there are a large number of open-source crawler software to implement this function, such as Solr and lucene sub-item; Nutch and apache projects, including large-scale crawler tools to capture and distinguish web site data; Grub, popular open-source web Crawler tools; Heritrix, an open-source Internet document search program; Aperture, supports crawling from websites, file systems, and mailboxes, parse and index the text data.

After obtaining the content, let's take a look at how to create a small data block based on the obtained content, which is also a document.

2. Create a document

After obtaining the original content, you must index the content and convert the content into a part document ). The document mainly includes several fields with values, such as title, body, abstract, author, and link. If the document and domain are important, you can also add weights. After the solution is designed, You need to extract the text from the original content and write it into each document. In this step, you can use the document filter. Open-source projects such as Tika can implement good document filtering. If the raw content to be obtained is stored in the database, some projects can easily perform operations and searches on the database table by means of the seamless link to the Content acquisition steps and document creation steps, for example, DBSight, Hibernate Search, LuSQL, Compass, and Oracle/Lucene integration projects.

3. Document Analysis

Search engines cannot directly index text: text must be divided into a series of independent atomic elements called vocabulary units. Each vocabulary unit corresponds to a word in a language. This step determines how text fields in a document are divided into a vocabulary unit series. Lucene provides a large number of embedded analyzer to easily control this step.

4. Document Index

Add the document to the index list. Lucene provides a strong file API in this step. You only need to call the provided methods to create a document index.

To provide a good user experience, indexes must be well handled: when designing and customizing index programs, you must focus on how to improve the user's search experience.

Ii. Search Components

The search component is used to enter a search phrase, perform word segmentation, and then search for a word from the index to find the document containing the word. The search quality is measured by the precision and the recall rate. The search details are still complex. This is one of the main things we will discuss about lucene later. In particular, the search speed and the ability to search for large volumes of data are important in the search technology. The search component mainly includes the following:

1.User search interface: The page that interacts with the user, that is, what can be seen in the browser. Here, we mainly consider page uidesign. A good UI design is an important part of attracting users.

2.Create a query: Creating a query mainly refers to the phrase that the user enters to query and submits it to the backend server in the form of common HTML or ajax. Then, the words are passed to the background search engine. This is a simple query process.

3.Search and query: Query the index and return the document that matches the query word. Then, sort the returned structure according to the query request. The search and query component covers most of the complex content in the search engine.

The following are common search theory models:

650) this. width = 650; "src =" http://img.blog.csdn.net/20131015111458937? Watermark/2/text/plain/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA ==/ dissolve/70/gravity/Center "style =" border: none; "alt =" Center "/>

4.Display Results: Displays results, similar to the first search interface. Is a front-end display page that interacts with users. As a search engine, user experience is always the first. Front-end display plays an important role in user embodiment.

OK. The above mainly describes two important components of the search program. Here is a brief introduction. We will introduce them in detail in future blog posts. Finally, let's take a brief look at the APIS provided by lucene in these two components.

650) this. width = 650; "src =" http://img.blog.csdn.net/20131015112245234? Watermark/2/text/plain/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA ==/ dissolve/70/gravity/Center "style =" border: none; "alt =" Center "/>

The figure below briefly explains:

1. the indexed Document is represented by a Document object.

2. IndexWriter adds the document to the index through the addDocument function to create the index.

3. Lucene indexes are reverse indexes.

4. When a user queries a request, Query represents the user's Query statement.

5. IndexSearcher searches Lucene Index through the search function

6. IndexSearcher calculates the Term Weight and Score and returns the result to the user.

7. The collection of documents returned to the user is represented by TopDocsCollector.

This article from "Cao Sheng Huan" blog, please be sure to keep this source http://javacsh.blog.51cto.com/3545281/1309131

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Search for those things-a detailed description of lucene (2) lucene search program components

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support