MySql5.7 creating a full-text index
InnoDB the default full-text index parser is well suited for Latin, because Latin is a word with empty glyd. But for Chinese, Japanese, and Korean, there is no such delimiter. A word can consist of multiple words, so we need to deal with it in different ways. In MySQL 5.7.6 we can use a new full-text indexing plug-in to process them: N-gram parser.
What is N-gram?
In a full-text index, N-gram is a sequential n-word sequence in a paragraph of text. For example, using N-gram to "information system" to the word segmentation, the results are as follows:
How do I use N-gram Parser in InnoDB?
N-gram parser is loaded into MySQL by default and can be used directly. We only need to use the with PARSER Ngram when creating the full-text index in the DDL.
We introduced a new global variable called Ngram_token_size. It determines the size of n in N-gram, which is the size of the word. Its default value is 2, and this time, we are using Bigram. Its legal value range is 1 to 10. Now, it's natural to think of a question: How should you set the size of the ngram_token_size value in the actual application? Of course, we recommend the use of 2. But you can also choose any legal value by following this simple rule: set to the size of the smallest word you want to be able to query. If you want to query a single word, then we need to set it to 1. The smaller the value of the ngram_token_size is, the less space the full-text index takes up. In general, a query that is exactly equal to the ngram_token_size word is faster, but a word or phrase that is longer than it is queried will become slower.
N-gram word processing
N-gram parser and the system default full-text index parser have the following differences:
- Word size check: Because of the ngram_token_size, Innodb_ft_min_token_size and innodb_ft_max_token_size will not apply to N-gram.
- Useless words (stopword) Processing: Usually, for a new word, we will look up the Stopwords table to see if there are any matching words. If so, the word is not added to the full-text index. But in N-gram, we'll look at the Stopwords table to see if it contains the words. The reason for this is that there are a lot of meaningless characters, words and punctuation in the text of CJK. For example, if we add ' the ' to the Stopwords table, then for the system of the sentence ' information ', by default we'll end up with the word ' information ', ' system '. Where ' the ' and ' system ' is filtered out.
We can information_schema by querying. Innodb_ft_index_cache and INFORMATION_SCHEMA. Innodb_ft_table_table to query which words are inside the full-text index. This is a very useful debugging tool. If we find a document that contains a word that does not appear in the query results as we expect it to, then the word may be for some reason not in the full-text index. For example, it contains Stopword, or its size is smaller than ngram_token_size and so on. This time we can check the two tables to confirm.
MySQL Full-Text Search Ngram Plugin