Learn from e.net

Last Update:2018-12-07 Source: Internet

Author: User

Tags keyword list split words

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(Pronounced['Lusen]) I often read about the Deer God. It's a good deer with a full head. Study it and make sure it feels so powerful! Lucene is an excellent open-source full-text search engine. We can develop various full-text search applications on it. Lucene is well known abroad and is now a top-level Apache project.

Ii. Overview of inverted index principles

Lucene is a high-performance Java full-text retrieval toolkit that uses the Inverted File index structure. ExplanationAlgorithmI will not talk about the theory. I will explain it with examples. if you carefully read the examples and really understand the ideas, you will surely understand the basic principles of Lucene indexing! Remember: understanding! Express the example in your own language, that is, translate it into your own things. Then you want to write it yourself, that is, translate it again in computer language !!
Lucene structure and corresponding generation algorithm are as follows:
We have two articlesArticle1 and 2
Article 1: Tom lives in Guangzhou, I live in Guangzhou too.
The content of article 2 is: he once lived in Shanghai.

1) full-text analysis:

Because Lucene is based on the keyword index and query, we first need to obtain the keywords of the two articles. Generally, we need to take the following measures:
A. Now we have some content in the article, that is, a string. First we need to find all words in the full text string, that is, word segmentation. English words are better processed because they are separated by spaces. Chinese words are connected together and require special word segmentation. (The methods of Common Chinese Dictionary are more accurate)
B. the words "in" and "once" and "too" in the article do not have any practical significance. The words "yes" in Chinese usually have no specific meaning, these words that do not represent concepts can be filtered out, that is, filtering in Word Segmentation analysis.
C. users usually want to find articles containing "he" and "he" When querying "he". Therefore, all words must be case-sensitive. That is, extra processing in the parsing process (you can add multiple processing as needed ).
D. users usually want to find articles including lives and lived when querying "live". Therefore, they need ", "lived" is restored to "live" to further optimize the processing (more humane and friendly)
E. punctuation marks in an article can be filtered out even if they do not represent a certain concept.
All the steps described aboveIn Lucene, all are completed by the analyzer class! After you understand the process, you can follow the relevant steps based on your own interests or needs to care about the features.CodeOr, you can further expand the feature.

After the above processing
All keywords in article 1 are: [Tom] [live] [Guangzhou] [live] [Guangzhou]
All the keywords in Article 2 are: [he] [live] [Shanghai]

2) inverted index:

With the keyword, we can create an inverted index.
The above resolved correspondence is: "Article number" to "all keywords in the Article ".
This kind of correspondence is quite in line with our normal habits of thinking. In turn, it is necessary to reverse thinking, so this kind of algorithm is called inverted ranking! Inverted indexes reverse the relationship: "Keyword" to "all document numbers with this keyword ".Article 1 and 2 are converted
Keyword Document No.
Guangzhou 1
He 2
I 1
Live 1, 2
Shanghai 2
Tom 1
To further expand the above basic inverted indexes, we usually only know which articles do not have enough keywords to appear in. We also need to know the number and position of keywords in the articles, there are usually two types of positions: a) character position, that is, to record the number of characters in the article (advantage is that the keyword is located fast when it is highlighted); B) keyword location, that is, to record the word as the first keyword in the article (advantage is to save the index space, phrase (phase) query is fast), which is recorded in Lucene.

After the "occurrence frequency" and "location" information are added, our index structure is enriched and changed:

Keywords	Article No.	[Frequency]	Location
Guangzhou	1	[2]	3, 6
He	2	[1]	1
I	1	[1]	4
Live	1	[2]	2, 5
	2	[1]	2
Shanghai	2	[1]	3
Tom	1	[1]	1

This structure is illustrated by taking the 4th-row live word index record as an example:
Live appears twice in article 1 and once in Article 2. What does it mean when it appears at "2, 5, 2? We need to analyze the document number and frequency. If article 1 appears twice, "2, 5" indicates the two positions that live appears in article 1, once in Article 2, the remaining "2" indicates that live is the 2nd keyword in article 2.
The above is the core part of the Lucene index structure. Is it easy to describe? Yes, it's actually that simple! Algorithms are all developed by people. Chairman Mao said that there are always more ways than difficulties, and computer algorithms are no exception. Why do you fear them? They are easy !!!

[For further exploration, if you do not want to go deep, you can skip this step]

If you have studied computer algorithms and data structures, you can easily find that the inverted index keyword we generated above isIn alphabetical order (Lucene does not use B tree structure ),Therefore, Lucene can be usedBinary Search AlgorithmQuickly locate keywords.

In specific implementation, Lucene saves the above three columns as dictionary files (term dictionary), frequency files (frequencies), and location files (positions. WhereDictionary files not only store every keyword, but also retain pointers pointing to frequency files and location files,The pointer can be used to find the frequency and position information of the keyword.

Lucene usedFieldConcept, usedLocation of the expression information (for example, in the title, in the article, or in the URL ),The index is being created.Field Information is also recorded in the dictionary fileEach keyword has a field information (because each keyword must belong to one or more fields ).

To reduce the size of the index file, Lucene also usesCompression Technology. First, the keywords in the dictionary file are compressed, and the keywords are compressed into <prefix length, suffix>. For example, the current word is "Arabic" and the previous word is "Arabic ", then the "Arabic" is compressed to <3, language>. Second, a large amount of data is used to compress a number. Only the difference between the number and the previous value is saved (this can reduce the length of the number and thus reduce the number of bytes required to save the number ). For example, the current article number is 16389 (it must be saved in 3 bytes without compression), the previous article number is 16382, and the compressed file is saved in 7 bytes (only one byte is used ).Note: "The Last word ". Because the dictionaries are arranged in order, this compression method is very effective.

Iii. Implementation Mechanism of the full-text search framework:

Lucene's API interface design is more common. The input and output structures are similar to the database tables ==> record ==> fields, therefore, files and databases of many traditional applications can be easily mapped to Lucene's storage structure/interface. In general, Lucene can be regarded as a database system that supports full-text indexing.

Compare the differences between Lucene data storage and traditional relational databases:

Lucene	Database
index data source: DOC (field1, field2 ...) DOC (field1, field2 ...) \ Indexer/ _____________ \| Lucene index \| -------------- /searcher \ result output: hits (DOC (field1, field2) DOC (field1 ...))	index data source: Record (field1, field2 ...) record (field1 ..) \ SQL: insert/ _____________ \| dB index \| ------------- /SQL: select \ result output: Results (Record (field1, field2 ..) record (field1 ...))
Document: a unit for indexing. A document consists of multiple fields.	Record: record that contains multiple fields
Field: Field	Field: Field
Hits: query result set, composed of matched documents	Recordset: query result set, composed of multiple records

Remember: Full-text retrieval like "% keyword %" in SQL statements of the database"

Because the database index is not designed for full-text indexing, when like "% keyword %" is used, the database index does not work. When you use like to query, the search process is similar to the page flip process. Therefore, for database services that contain fuzzy queries, like has a great impact on performance. If you need to perform fuzzy match on multiple keywords: Like "% keyword1 %" And like "% keyword2 %"... the efficiency can be imagined.

Usually thick books are followed by keyword index tables (for example, Beijing: pages 12, 34, Shanghai: Pages 3, 77 ......), It helps readers quickly find the page numbers of related content. The principle that database indexes can greatly increase the speed of queries is the same. Imagine how many times faster the index search speed after a book is than the page by page ...... Another reason for the high efficiency of indexes is that they are sorted in order. The core of the search system is sorting.

Therefore, the key to establishing an efficient retrieval system is to establish a reverse index mechanism similar to the technical index to store data sources (such as multiple articles) in order, there is another sorted keyword list for storing keyword ==> Article ing relationships. Using this ing index: [keyword => keyword document number, number of occurrences (even including location: Start offset, end offset), occurrence frequency]. The search process is to convert a fuzzy query into a logical combination of multiple precise queries that can use indexes. This greatly improves the efficiency of Multi-Keyword queries. Therefore, the full-text search problem comes down to a sorting problem.

It can be seen that fuzzy query is a very uncertain issue relative to the exact query of the database, which is also the reason why most databases have limited support for full-text search.

The core feature of Lucene is that it uses a special index structure to implement the full-text index mechanism that traditional databases are not good at. It also provides an extended interface to facilitate customization for different applications.

We will compare the differences between Lucene full-text index and relational database fuzzy query algorithms:

	Lucene full-text index Engine	Database
Index	Use full-text indexes to create inverted indexes for all data in the data source.	For like queries, traditional data indexes cannot be used at all. Data needs to be easily recorded one by one for grep-type Fuzzy Matching, which is more than an order of magnitude lower than the search speed with indexes.
Matching Effect	Match by term and implement the language analysis interface to support non-English languages such as Chinese.	If you use like "% net %", the Netherlands will also be matched, Fuzzy match of multiple keywords: Using like "% Com % net %": cannot match xxx.net..xxx.com
Matching degree	There is a matching degree algorithm that puts the results with a higher degree of similarity at the top.	No matching degree control: for example, if the net keyword appears once in a record, the result is the same.
Result output	Use a special algorithm to set the header with the highest matching degree100Results are output, and the result set is read in a buffer-type small batch.	Return all result sets. When there are many matching entries (such as tens of thousands), a large amount of memory is required to store these temporary result sets.
Customization	Using different language analysis interfaces, you can easily customize the index rules that meet application requirements (including Chinese support)	Unable to customize because no interface or interface is complex
Conclusion	For fuzzy query applications with high load, the rules for fuzzy query are required. The index data volume is large.	Low usage, simple fuzzy matching rules, or a small amount of data to be queried

The biggest difference between full-text search and database applications is: making the most relevantThe first 100 results meet the needs of more than 98% users.

Lucene's innovations:

Most search (database) engines use the B tree structure to maintain indexes. Index updates may result in a large number of Io operations. Lucene is implementing this, which is slightly improved: instead of maintaining an index file, we constantly create new index files when expanding the index, then, these new small index files are periodically merged into the original large index (the batch size can be adjusted for different update policies ), this improves the indexing efficiency without affecting the retrieval efficiency.
Comparison between Lucene and other full-text retrieval systems/applications:

	Lucene	Other open-source full-text retrieval systems
Incremental index and batch Index	You can perform an incremental index (append) to index a large amount of data in batches, and the interface design is used to optimize the batch index and small batch incremental index.	Many systems only support batch indexing, and sometimes the data source needs to be re-indexed if there is a slight increase.
Data Source	Lucene does not define a specific data source, but a document structure, so it can be very flexible to adapt to various applications (as long as the front-end has a suitable converter to convert the data source to the corresponding structure ).	Many systems only target webpages and lack the flexibility of documents in other formats.
Index content capture	Lucene documents are composed of multiple fields. You can even control the fields that need to be indexed, and those fields do not need to be indexed, the fields of the last-step index can also be divided into the types that require word segmentation and do not require Word Segmentation: The index that requires word segmentation, such as the title and content fields of the article. Indexes that do not require word segmentation, such as the author/Date Field.	Lack of versatility, the entire document is often indexed
Language analysis	Implemented through different extensions of the Language Analyzer: You can filter out unnecessary words: An the of, etc, Spanish syntax analysis: Summarize jumps jumped Jumper into jump for index/search Non-English support: supports indexing of Asian and Arabic languages	Lack of universal interface implementation
Query Analysis	Through the implementation of the query and analysis interface, you can customize your own query syntax rules: For example, the +-and or relationship among multiple keywords.	Powerful functions
Concurrent access	Support for multiple users	Powerful functions

4. Word Segmentation in Asian languages)

For Chinese, full-text indexing must first solve the problem of language analysis. For English, words in a statement are naturally separated by spaces, however, the Chinese and Japanese Korean statements in the Asian language contain one word. All words must be indexed by "word" first, how to split this word is a big problem.

First, it is certainly not possible to use a single character as the index unit (Si-gram). Otherwise, when querying "Shanghai", it cannot match with "sea. But in a word: How does a computer split the word "Tiananmen Square, Beijing" according to Chinese language habits? "Beijing Tiananmen" or "Beijing Tiananmen "? To enable the computer to split according to the language habits, it is often necessary for the machine to have a rich dictionary to accurately identify words in the statement.

Another solution is to use the automatic segmentation algorithm to split words according to the two-element syntax (bigram). For example:

"Tiananmen Square, Beijing" => "Beijing jingtian tianan ".

In this way, when querying "Beijing" or "Tiananmen Square", the query phrase is split according to the same rule: "Beijing", "Tian 'an ", the combination of multiple keywords and "and" can also be correctly mapped to the corresponding index. This method is applicable to other Asian languages: Korean and Japanese.

The biggest advantage of automatic segmentation is that there is no vocabulary maintenance cost, which is easy to implement. The disadvantage is that the indexing efficiency is low. However, for small and medium-sized applications, the segmentation based on the 2 yuan syntax is sufficient. The size of the index after Split Based on 2 RMB is similar to that of the source file. For English, the index file is generally different from 30%-40% of the original file.

	Automatic splitting	Word Segmentation
Implementation	Easy to implement	Complicated implementation
Query	Added the complexity of query analysis.	Suitable for implementing complex query syntax rules
Storage Efficiency	The index redundancy is large, and the index is almost as large as the original text	High indexing efficiency, about 30% of the original size
Maintenance Cost	Maintenance Cost without vocabulary	Word Table maintenance costs are very high: China, Japan, Korea, and other languages need to be maintained separately. It also needs to include Word Frequency Statistics and other content
Applicable fields	Embedded System: limited Operating Environment Resources Distributed System: No Word Table synchronization problem Multi-language environment: no vocabulary Maintenance Cost	Professional search engines that require high query and storage efficiency

Currently, the language analysis algorithms of large search engines are generally based on the combination of the above two mechanisms.

V. Lucene structure framework:
Note: Some Complex lexical analysis in Lucene is generated using javacc (javacc: javacompilercompiler, pure Java lexical analysis generator ).Source codeCompile or modify queryparser and customize your lexical analyzer. You also need to download javaccesskey from https://javacc.dev.java.net.
Lucene composition: Index and search are the main external application portals for external applications.

Org. Apache. Lucene. Search/	Search entry
Org. Apache. Lucene. Index/	Index Entry
Org. Apache. Lucene. Analysis/	Language Analyzer
Org. Apache. Lucene. queryparser/	Query Analyzer
Org.apache.e.doc ument/	Storage Structure
Org. Apache. Lucene. Store/	Underlying IO/Storage Structure
Org. Apache. Lucene. util/	Some Common Data Structures

6. learn more from Lucene:

Luene is indeed a model for Object design. If you want to learn object-oriented programming, and you want to learn Java, we suggest you learn Lucene, play with a real project, and be a famous basic project, you can not only learn how to use this project, but also the Java syntax and the code skills of the masters. What can you do for Lucene in the future, you can do more !!! Why ?!

All problems can be easily extended and reused through an extra abstraction layer: you can achieve your goal through re-implementation, instead of requiring other modules;
A simple application entry searcher and indexer, and calls a series of underlying components to complete the search task collaboratively;
The tasks of all objects are very specific: for example, the search process: queryparser analysis converts query statements into a series of precise query combinations ), use the underlying index reading structure indexreader to read the index, and use the corresponding scoring tool to score/sort the search results. All functional modules are very atomic, so you can implement them again without modifying other modules.
In addition to flexible application interface design, Lucene also provides some Language Analyzer implementations (simpleanalyser, standardanalyser) suitable for most applications, which is one of the important reasons for new users to get started quickly.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More