In-depth analysis of MySQL 5.7 Chinese full-text search, MySQL

Source: Internet
Author: User

In-depth analysis of MySQL 5.7 Chinese full-text search, MySQL

Preface

In fact, full-text retrieval has been supported for a long time in MySQL, but it has always supported only English. The reason is that he has always used space as the separator for word segmentation. for Chinese, it is obviously inappropriate to use space, and it is necessary to perform word segmentation for Chinese Semantics. This is not the case. Starting from MySQL 5.7, MySQL has a built-in ngram full-text retrieval plug-in that supports Chinese Word Segmentation and is effective for the MyISAM and InnoDB engines.

Before using the Chinese search word segmentation plug-in ngram, you must set the word segmentation size in the MySQL configuration file, for example,

[mysqld]ngram_token_size=2

Set the word segmentation size to 2. Remember, the larger the word segmentation SIZE, the larger the index SIZE, so you must set the appropriate SIZE according to your own situation.

Example Table Structure:

CREATE TABLE articles (   id INTUNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,   titleVARCHAR(200),   body TEXT,   FULLTEXT (title,body) WITH PARSER ngram  ) ENGINE=InnoDBCHARACTER SET utf8mb4;

The sample data has 6 rows of records.

Mysql> select * from articles \ G *************************** 1. row ************************** id: 1 title: database management body: in this tutorial, I will show you how to manage databases ***************************** 2. row ************************** id: 2 title: database application development body: learn how to develop database applications **************************** 3. row ************************** id: 3 title: MySQL full manual body: learn everything about MySQL *************************** 4. row ************************* id: 4 title: database and transaction processing body: introduction to database transactions: ****************************** 5. row ************************** id: 5 title: NoSQL essence body: learn about the various types of unstructured databases ***************************** 6. row ************************** id: 6 title: SQL language details body: for more information, see SQL6 rows inset (0.00 sec)

Explicitly specify the full-text retrieval table Source

mysql> SETGLOBAL innodb_ft_aux_table="new_feature/articles";Query OK, 0 rows affected (0.00 sec)

Through the system table, you can view how the data in articles is divided.

Mysql> SELECT * FROM information_schema.INNODB_FT_INDEX_CACHE LIMIT 20, 10; + ------ + -------------- + upper + ----------- + -------- + ---------- + | WORD | Upper | Lower | DOC_COUNT | DOC_ID | POSITION | + ------ + -------------- + ----------- + -------- + ------------ + | medium me | 2 | 2 | 1 | 2 | 28 | Xi m | 4 | 4 | 1 | 4 | 21 | learned | 6 | 6 | 1 | 6 | 16 | Xi Kai | 3 | 3 | 1 | 3 | 25 | Number of Xi | 5 | 5 | 1 | 5 | 37 | understand | 6 | 7 | 2 | 6 | 19 | understand | 6 | 7 | 2 | 7 | 23 | transaction | 5 | 5 | 1 | 5 | 12 | transaction | 5 | 5 | 1 | 5 | 40 | He Guan | 2 | 2 | 1 | 2 | 52 | + ------ + -------------- + ------------- + ----------- + -------- + ---------- + 10 rows in set (0.00 sec)

We can see that the tokenization length is set to 2, and there are only two groups of data. The preceding data also contains the row location, ID, and other information.

Next, I will perform a series of search demonstrations, which are used in the same way as the original English search.

1. Search in natural language mode:

1. Obtain the number of qualified instances,

Mysql> select count (*) FROM articles-> where match (title, body) AGAINST ('database' in naturallanguage mode); + ---------- + | COUNT (*) | + ---------- + | 4 | + ---------- + 1 row in set (0.05 sec)

2. Obtain the matching ratio,

Mysql> SELECT id, MATCH (title, body) AGAINST ('database' in natural language mode) AS score FROM articles; + ---- + -------------------- + | id | score | + ---- + -------------------- + | 1 | 0.12403252720832825 | 2 | 0.12403252720832825 | 3 | 0 | 4 | 0.12403252720832825 | 5 | 0.062016263604164124 | | 6 | 0 | + ---- + -------------------- + 6 rows in set (0.00 sec)

2. Searching in Boolean mode is more complex than searching in Natural Mode:

1. Matching records with both management and database,

Mysql> SELECT * FROM articles where match (title, body)-> AGAINST ('+ database + Management' in boolean mode ); + ---- + ------------ + response + | id | title | body | + ---- + ------------ + -------------------------------------- + | 1 | database management | in this tutorial, I will show you how to manage databases | + ---- + ------------ + -------------------------------------- + 1 rowin set (0.00 sec)

2. Matching with a database but no management records,

Mysql> SELECT * FROM articles where match (title, body)-> AGAINST ('+ database-Management' in boolean mode ); + ---- + ------------------ + response + | id | title | body | + ---- + ------------------ + ---------------------------- + | 2 | Database Application Development | learn to develop database applications | 4 | database and transactions processing | introduction to system learning database transactions | 5 | NoSQL essence | learn about various unstructured databases | + ---- + ------------------ + -------------------------------- + 3 rows in set (0.00 sec)

3. Match MySQL, but reduce the database relevance,

Mysql> SELECT * FROM articles where match (title, body)-> AGAINST ('> database + MySQL 'inboolean MODE ); + ---- + --------------- + | id | title | body | + ---- + ----------------- + | 3 | MySQL full manual | learn everything about MySQL | + ---- + --------------- + tips + 1 rowin set (0.00 sec)

Iii. query extension mode. For example, to search for a database, MySQL, oracle, and DB2 will also be searched,

Mysql> SELECT * FROM articles-> where match (title, body)-> AGAINST ('database' with query expansion ); + ---- + ------------------ + certificate + | id | title | body | + ---- + ------------------ + ------------------------------------ + | 1 | database management | in this tutorial, I will show you how to manage databases | | 4 | database and transaction processing | system learning introduction to database transactions | 2 | Database Application Development | learning to develop database applications | 5 | NoSQL essence | learning to understand various unstructured data database | 6 | detailed description of the SQL language | learn more about using various SQL statements | 3 | MySQL full manual | learn about MySQL everything | + ---- + -------------------- + ------------------------------ + 6 rows in set (0.01 sec)

Of course, I will only present the function here. For more performance tests, you can perform detailed tests if you are interested. N-grm is a commonly used word segmentation algorithm for Chinese search and has been widely used on the Internet. Therefore, the integration of N-grm into MySQL won't be too effective.

Summary

The above is all about this article. I hope this article will help you in your study or work. If you have any questions, please leave a message.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.