Design class Diagram

Source: Internet
Author: User

1: Search class Diagram

Index efficiency of 2.Lucene

Often the back of a book is often accompanied by a keyword index table (for example: Beijing: 12, 34, Shanghai: 3, 77 pages ...). ), which helps the reader find the page number of the relevant content more quickly. And the database index can greatly improve the speed of the principle of the query is also the same, imagine the index behind the book to find the speed is more than a page by page to the content of the number of times higher ... And the index is efficient, another reason is that it is well-sequenced. For the retrieval system, the core is a sort problem.

Because database indexes are not designed for full-text indexing, database indexes do not work when you use the like "%keyword%", and when you use similar queries, the search process becomes a page-by-page traversal process, so for database services that contain fuzzy queries, The harm of like to performance is enormous. If you need to make a fuzzy match for multiple keywords: like "%keyword1%" and "%keyword2%" ... Its efficiency can be imagined. Therefore, the key to establishing an efficient retrieval system is to establish a reverse indexing mechanism similar to the index of science and technology, and to store data sources (such as multiple articles) in the same order, there is another list of key words, which is used to store the keyword ==> article mapping relationship, using such a mapping relationship index: [keywords ==> the article number of the keyword appears, the number of occurrences (even the position: Start offset, end offset), frequency], the retrieval process is the process of turning the fuzzy query into a logical combination of multiple, accurate queries that can take advantage of the index. Thus, the efficiency of multi-keyword query is greatly improved, so the problem of full-text retrieval is finally a sort problem.

It can be seen that the exact query of fuzzy queries relative to the database is a very uncertain problem, which is the reason why most databases have limited support for full-text retrieval. The most central feature of Lucene is the implementation of a full-text indexing mechanism that traditional databases do not excel at through a special indexing structure, and an extended interface to facilitate customization for different applications.

You can compare a database's fuzzy query with a table:

 

Lucene Full-text indexing engine

Database

Index

To create a reverse index of data from a data source through a full-text index

For a like query, the traditional index of the data is not used at all. Data needs to be conveniently documented to perform grep-style fuzzy matching, with several orders of magnitude lower than indexed search speeds.

Match effect

Through the lexical element (term) to match, through the language Analysis interface realization, may realize to the Chinese and so on non-English support.

Use: Like "%net%" will also match the Netherlands, multiple keywords fuzzy matching: using "%com%net%": Can not match the reverse order of the xxx.net..xxx.com

Match degree

There is a matching degree algorithm, the result of matching degree (similarity) is higher.

There is no matching degree of control: for example, in the record net appears 5 words and 1 times, the result is the same.

Result output

Through the special algorithm, the most matching first 100 results output, the result set is buffered small batch read.

Returns all result sets that require a large amount of memory to hold these temporary result sets when there are very many matching entries (such as tens of thousands).

Customizable

Through different language analysis interface implementation, can be easily customized to meet the needs of the application of the index rules (including support for Chinese)

No interfaces or interfaces are complex and cannot be customized

Conclusion

High load Fuzzy query application, need to be responsible for fuzzy query rules, index data volume is larger

Low usage, simple fuzzy matching rules or small amount of data requiring fuzzy query

3 Chinese word segmentation mechanism

For Chinese, the full-text index first also to solve a language analysis of the problem, for English, the words in the statement is naturally separated by a space, but the Asian language CJK statements in the word is a word, all, first to the statement in the "word" index words, How this word is sliced out is a big problem.

First of all, certainly cannot use the single character relabeled (Si-gram) as the index unit, otherwise check "Shanghai", cannot let contain "the sea" also matches. But in a word: "Beijing Tian ' an door", how the computer according to the Chinese language habits of segmentation? "Beijing Tian ' an gate" or "Beijing Tian ' an gate"? So that the computer can be divided according to the language habits, often requires the machine has a relatively rich thesaurus to be able to more accurately identify the words in the statement. Another solution is to use the automatic segmentation algorithm: The word in accordance with the 2-yuan Syntax (bigram), such as: "Beijing Tian ' an door" ==> "Beijing every day an Ann door." Thus, in the query, whether the query "Beijing" or query "Tiananmen Square", the query phrase according to the same rules: "Beijing", "Tian an", a number of keywords between the "and" the relationship between the combination of the same can be correctly mapped to the corresponding index. This is common for other Asian languages: Korean and Japanese.

The biggest advantage of automatic segmentation is that there is no thesaurus maintenance cost, simple implementation, disadvantage is low index efficiency, but for small and medium-sized applications, based on the 2-yuan syntax segmentation is sufficient. Based on the 2 yuan after the segmentation of the index general size and the source file is similar, and for English, index files generally only the original file 30%-40% different,

Automatic segmentation

Thesaurus Segmentation

Realize

Implementation is very simple

Achieve complex

Inquire

Increases the complexity of query analysis,

Suitable for the implementation of more complex query syntax rules

Storage efficiency

Index redundancy is large and the index is almost as large as the original

High index efficiency, about 30% of the original size

Maintenance costs

No Glossary maintenance costs

Glossary maintenance costs are very high: languages such as China, Japan and Korea need to be maintained separately. Also need to include word frequency statistics and other content

Applicable fields

Embedded System: Running Environment Resource Limited Distributed system: No thesaurus synchronization problem multi-lingual environment: no thesaurus maintenance cost

A professional search engine with high query and storage efficiency requirements

Design class Diagram

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.