MySQL Full-Text Search Ngram Plugin

Source: Internet
Author: User

MySql5.7 creating a full-text index

InnoDB the default full-text index parser is well suited for Latin, because Latin is a word with empty glyd. But for Chinese, Japanese, and Korean, there is no such delimiter. A word can consist of multiple words, so we need to deal with it in different ways. In MySQL 5.7.6 we can use a new full-text indexing plug-in to process them: N-gram parser.

What is N-gram?

In a full-text index, N-gram is a sequential n-word sequence in a paragraph of text. For example, using N-gram to "information system" to the word segmentation, the results are as follows:

How do I use N-gram Parser in InnoDB?

N-gram parser is loaded into MySQL by default and can be used directly. We only need to use the with PARSER Ngram when creating the full-text index in the DDL.

We introduced a new global variable called Ngram_token_size. It determines the size of n in N-gram, which is the size of the word. Its default value is 2, and this time, we are using Bigram. Its legal value range is 1 to 10. Now, it's natural to think of a question: How should you set the size of the ngram_token_size value in the actual application? Of course, we recommend the use of 2. But you can also choose any legal value by following this simple rule: set to the size of the smallest word you want to be able to query. If you want to query a single word, then we need to set it to 1. The smaller the value of the ngram_token_size is, the less space the full-text index takes up. In general, a query that is exactly equal to the ngram_token_size word is faster, but a word or phrase that is longer than it is queried will become slower.

N-gram word processing

N-gram parser and the system default full-text index parser have the following differences:

    1. Word size check: Because of the ngram_token_size, Innodb_ft_min_token_size and innodb_ft_max_token_size will not apply to N-gram.
    2. Useless words (stopword) Processing: Usually, for a new word, we will look up the Stopwords table to see if there are any matching words. If so, the word is not added to the full-text index. But in N-gram, we'll look at the Stopwords table to see if it contains the words. The reason for this is that there are a lot of meaningless characters, words and punctuation in the text of CJK. For example, if we add ' the ' to the Stopwords table, then for the system of the sentence ' information ', by default we'll end up with the word ' information ', ' system '. Where ' the ' and ' system ' is filtered out.

We can information_schema by querying. Innodb_ft_index_cache and INFORMATION_SCHEMA. Innodb_ft_table_table to query which words are inside the full-text index. This is a very useful debugging tool. If we find a document that contains a word that does not appear in the query results as we expect it to, then the word may be for some reason not in the full-text index. For example, it contains Stopword, or its size is smaller than ngram_token_size and so on. This time we can check the two tables to confirm.

MySQL Full-Text Search Ngram Plugin

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.