Design class Diagram

Last Update:2015-06-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1: Search class Diagram

Index efficiency of 2.Lucene

Often the back of a book is often accompanied by a keyword index table (for example: Beijing: 12, 34, Shanghai: 3, 77 pages ...). ), which helps the reader find the page number of the relevant content more quickly. And the database index can greatly improve the speed of the principle of the query is also the same, imagine the index behind the book to find the speed is more than a page by page to the content of the number of times higher ... And the index is efficient, another reason is that it is well-sequenced. For the retrieval system, the core is a sort problem.

Because database indexes are not designed for full-text indexing, database indexes do not work when you use the like "%keyword%", and when you use similar queries, the search process becomes a page-by-page traversal process, so for database services that contain fuzzy queries, The harm of like to performance is enormous. If you need to make a fuzzy match for multiple keywords: like "%keyword1%" and "%keyword2%" ... Its efficiency can be imagined. Therefore, the key to establishing an efficient retrieval system is to establish a reverse indexing mechanism similar to the index of science and technology, and to store data sources (such as multiple articles) in the same order, there is another list of key words, which is used to store the keyword ==> article mapping relationship, using such a mapping relationship index: [keywords ==> the article number of the keyword appears, the number of occurrences (even the position: Start offset, end offset), frequency], the retrieval process is the process of turning the fuzzy query into a logical combination of multiple, accurate queries that can take advantage of the index. Thus, the efficiency of multi-keyword query is greatly improved, so the problem of full-text retrieval is finally a sort problem.

It can be seen that the exact query of fuzzy queries relative to the database is a very uncertain problem, which is the reason why most databases have limited support for full-text retrieval. The most central feature of Lucene is the implementation of a full-text indexing mechanism that traditional databases do not excel at through a special indexing structure, and an extended interface to facilitate customization for different applications.

You can compare a database's fuzzy query with a table:

	Lucene Full-text indexing engine	Database
Index	To create a reverse index of data from a data source through a full-text index	For a like query, the traditional index of the data is not used at all. Data needs to be conveniently documented to perform grep-style fuzzy matching, with several orders of magnitude lower than indexed search speeds.
Match effect	Through the lexical element (term) to match, through the language Analysis interface realization, may realize to the Chinese and so on non-English support.	Use: Like "%net%" will also match the Netherlands, multiple keywords fuzzy matching: using "%com%net%": Can not match the reverse order of the xxx.net..xxx.com
Match degree	There is a matching degree algorithm, the result of matching degree (similarity) is higher.	There is no matching degree of control: for example, in the record net appears 5 words and 1 times, the result is the same.
Result output	Through the special algorithm, the most matching first 100 results output, the result set is buffered small batch read.	Returns all result sets that require a large amount of memory to hold these temporary result sets when there are very many matching entries (such as tens of thousands).
Customizable	Through different language analysis interface implementation, can be easily customized to meet the needs of the application of the index rules (including support for Chinese)	No interfaces or interfaces are complex and cannot be customized
Conclusion	High load Fuzzy query application, need to be responsible for fuzzy query rules, index data volume is larger	Low usage, simple fuzzy matching rules or small amount of data requiring fuzzy query

3 Chinese word segmentation mechanism

For Chinese, the full-text index first also to solve a language analysis of the problem, for English, the words in the statement is naturally separated by a space, but the Asian language CJK statements in the word is a word, all, first to the statement in the "word" index words, How this word is sliced out is a big problem.

First of all, certainly cannot use the single character relabeled (Si-gram) as the index unit, otherwise check "Shanghai", cannot let contain "the sea" also matches. But in a word: "Beijing Tian ' an door", how the computer according to the Chinese language habits of segmentation? "Beijing Tian ' an gate" or "Beijing Tian ' an gate"? So that the computer can be divided according to the language habits, often requires the machine has a relatively rich thesaurus to be able to more accurately identify the words in the statement. Another solution is to use the automatic segmentation algorithm: The word in accordance with the 2-yuan Syntax (bigram), such as: "Beijing Tian ' an door" ==> "Beijing every day an Ann door." Thus, in the query, whether the query "Beijing" or query "Tiananmen Square", the query phrase according to the same rules: "Beijing", "Tian an", a number of keywords between the "and" the relationship between the combination of the same can be correctly mapped to the corresponding index. This is common for other Asian languages: Korean and Japanese.

The biggest advantage of automatic segmentation is that there is no thesaurus maintenance cost, simple implementation, disadvantage is low index efficiency, but for small and medium-sized applications, based on the 2-yuan syntax segmentation is sufficient. Based on the 2 yuan after the segmentation of the index general size and the source file is similar, and for English, index files generally only the original file 30%-40% different,

	Automatic segmentation	Thesaurus Segmentation
Realize	Implementation is very simple	Achieve complex
Inquire	Increases the complexity of query analysis,	Suitable for the implementation of more complex query syntax rules
Storage efficiency	Index redundancy is large and the index is almost as large as the original	High index efficiency, about 30% of the original size
Maintenance costs	No Glossary maintenance costs	Glossary maintenance costs are very high: languages such as China, Japan and Korea need to be maintained separately. Also need to include word frequency statistics and other content
Applicable fields	Embedded System: Running Environment Resource Limited Distributed system: No thesaurus synchronization problem multi-lingual environment: no thesaurus maintenance cost	A professional search engine with high query and storage efficiency requirements

Design class Diagram

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Design class Diagram

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Design class Diagram

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support