Full-text search is an index of big data text. In the created index, you can search for the words to be searched and locate the text data including the words to be searched. Therefore, all the work of full-text search is to create an index and search and locate the index. All the work is centered around these two aspects. Next we will introduce them one by one.
It is very important to create a full-text index. One is how to split text, and the other is how to create an index data structure. Word Segmentation is basically a binary word segmentation method, a maximum matching method, and a statistical method. The index data structure basically adopts the inverted index structure.
Word Segmentation is related to the accuracy of the query and the size of the generated index. In the development of Chinese word segmentation, binary word segmentation is often used in the early days. The basic principle of this method is to separate sentences that contain Chinese Characters in binary form without considering the meaning of words, only binary words are indexed. Therefore, this method separates a large number of words, resulting in a huge number of indexes. useless data is retrieved during queries. The advantage is that the algorithm is simple and does not miss the retrieved data. Then the maximum matching word segmentation method is developed. This method is further divided into forward maximum word segmentation and reverse maximum word segmentation. The principle is similar to dictionary query. A dictionary is generated for commonly used words, which analyzes the largest matching word in the dictionary during the sentence process, and splits the sentence into meaningful word chains. The forward Word Segmentation Method in the maximum matching method is prone to errors in distinguishing Partial Formal words. For example, "jewelry and clothing" may separate "and" kimono "as words. DM database adopts the improved inverse largest word segmentation method, which is more accurate than positive. The most complex method is word segmentation through statistical methods. This method uses an implicit Markov chain, that is, the probability of occurrence of the last word depends on the probability of occurrence of the previous word, and finally counts the maximum probability of occurrence of all words as the basis for word segmentation. The recognition of new terms and place names is much higher than that of the largest matching method. The accuracy increases with the increase of the number of sample texts.
The binary word segmentation method and statistical method do not depend on the dictionary, while the maximum matching method relies on the dictionary. The content of the dictionary determines the word segmentation structure.
Full-text search indexes are called inverted indexes. They become inverted indexes because each word is used as an index and the text containing the word is searched based on the index. Therefore, the index is a one-to-many relationship between the word and the unique record text. Sort the indexed words and locate the text containing the words based on the sorted words.
Reverse word segmentation process description
Step 1) read the entire sentence to the variable STR and go to step 2.
Step 2) read 1 word from the end of the sentence to the variable word and go to step 3
Step 3) search for words saved in word in the dictionary. If the word exists, save the word and go to Step 4. Otherwise, go to Step 5)
Step 4) if it is the largest word in the dictionary or exceeds the maximum number of words (identified as a new word), remove the word from the end of the sentence and return Step 2
Step 5) read the previous word to the word to form a new word and go to step 3)
The memory data structure of the dictionary and the matching algorithm of words in the dictionary
Words in memory are stored in hierarchies
Assume that the dictionary contains the following words: people's democracy in the Republic of China
In the memory, arrange by layer as follows. Each square represents a word, and the arrow points to the first word of the word.
Match words as follows
For example, search for the word "Republic of China"
First, find the word "country" using the bipartite method in the first layer.
Obtain the array of "Chinese people" in the lower layer of "country"
In this layer, we use the binary method to find the "people" and obtain the array "China" under the "people" layer"
In this layer, use the binary method to find "China" and obtain the array "medium" under "China"
At this point, the matching is completed.
Index format
The index format is the inverted index format, that is, a word corresponds to several text representations.
In DM database, the object for full-text indexing is the field in Rec. The inverted index is generated and stored using the B tree in the database.
A full-text index is performed on a character field in the database. Therefore, the rowid of REC must be recorded as the Field identifier on the rec.
Therefore, the format of inverted index storage is as follows.
Field 1 |
Field 2 |
Word 1 |
Rowid1, rowid2... |
Word 1 |
Rowid1, rowid2... |
... |
... |
Because the field length of B tree is limited, the format of B tree storage is
Field 1 |
Field 2 |
Field 3 |
Word 1 |
Number of 1 rowid Cells |
Rowid1, rowid2... |
Word 1 |
Number of 2 rowid Cells |
Rowid1, rowid2... |
... |
... |
... |
Full-text index Query
Full-text index query first performs word segmentation on the words to be queried, and then finds all rowids containing these words in the B tree that stores inverted indexes, based on the rowid, the row containing the data is filtered out in the B-tree that stores the actual data.
A complete SQL Server full-text index example. (Taking the pubs database as an example) first, we will introduce the Exploitation System
1) Start the full-text processing function of the database (sp_fulltext_database)
2) create a full-text directory (sp_fulltext_catalog)
3) register the table in the full-text directory that requires full-text indexing (sp_fulltext_table)
4) Name of the column that requires full-text indexing in the table (sp_fulltext_column)
5) create a full-text index for the table (sp_fulltext_table)
6) Fill in the full-text directory (sp_fulltext_catalog)
--------- ********* Example ********-------------
To create a full-text index for the title and notes columns of the pubs database, and then use the index to query the title column or the Notes column
Name of a book that contains a Datebase or computer string: before that, you must install the Microsoft Search Service
User pubs -- open the database
Go
-- Check whether the database pubs supports full-text indexing. If not
-- Use sp_fulltext_database to enable this function.
If (select databaseproperty ('pubs', 'isfulltextenabled ') = 0
Execute sp_fulltext_database 'enable' -- create the full-text directory ft_pubs
Execute sp_fulltext_catalog 'ft _ pubs ', 'create' -- create full-text index data element for the title table
Execute sp_fulltext_table 'title', 'create', 'ft _ pubs', 'upkcl _ titleidind '-- Set
Execute sp_fulltext_column 'title', 'title', 'add'
Execute sp_fulltext_column 'title', 'note', 'add' -- create a full-text index
-- Activate: Enables full-text retrieval of a table, that is, registering the table in the full-text directory.
Execute sp_fulltext_table 'title', 'activate' -- fill in the full-text INDEX DIRECTORY
Execute sp_fulltext_catalog 'ft _ pubs ', 'start _ full'
Go -- check the full-text directory filling status while fulltextcatalogproperty ('ft _ pubs', 'populatestatus') <> 0
Begin -- if the full-text directory is in the filled state, wait 30 seconds before detection.
Waitfor delay '0: 0: 30'
End -- after the full-text directory is filled, you can use the full-text directory to retrieve the select title
Form
Where contains (title, 'database ')
Or contains (title, 'computer ')
Or contains (notes, 'database ')
Or contains (notes, 'database ')
'-------------- The following describes the full-text operating system stored procedures.
Procedure name: sp_fulltext_service
Execution permission: serveradmin or System Administrator
Usage: set full-text search attributes
Procedure name: sp_fulltext_catalog
Execution permission: db_owner and higher role members
For use: Create and delete a full-text directory, and start or stop the index operation of a full-text directory.
Procedure name: sp_fulltext_database
Execution permission: Member of db_owner role
For use: Initialize the full-text index or delete all full-text directories in the database
Procedure name: sp_fulltext_table
Execution permission: db_ddladnmin or db_owner role member
For use: Identify a table as a full-text index table or a non-full-text index table.
Procedure name: sp_fulltext_column
Execution permission: Member of the db_ddladnmin role
Usage: indicates the columns in a full-text index table. If the full-text index is left or not