Mysql full-text index _ MySQL

Source: Internet
Author: User
Mysql full-text index bitsCN.com

When MySQL has high concurrent connections and a large number of database records, SELECT... WHERE... LIKE '%... % 'The full-text search method is not only inefficient, but also cannot be used for queries starting with the wildcard "%" and "_". a full table scan is required, which puts a lot of pressure on the database. MySQL provides a full-text index solution to solve this problem, which not only improves performance and efficiency (because MySQL indexes these fields to optimize search ), in addition, it achieves higher quality search. However, up to now, MySQL does not support full-text Chinese indexing correctly.

An important difference between Chinese and Western words, such as English, is that western words are separated by spaces in units of words. Chinese characters are in the unit of words. words are composed of one or more words. there is no space between words and words. When you try to use full-text search in a field containing Chinese characters, you will not get the correct result because the Chinese characters do not define words like English spaces and cannot be separated by spaces, index Chinese words.

I. features of MySQL full-text index plug-in mysqlcft:
1. advantages:
① High accuracy: the "three-byte crossover splitting algorithm" is used to separate Chinese statements without a Chinese word segmentation Dictionary. The search accuracy is far higher than the Chinese word segmentation algorithm, LIKE '%... %.
② Fast query speed: Search speed is 3 ~ faster than LIKE '%... % ~ 50 times. There are test results at the end of the article;
③ Standard plug-in type: developed in the form of a standard plug-in for full-text index of MySQL 5.1, without modifying the MySQL source code or affecting other functions of MySQL, you can quickly follow up the new version of MySQL;
④ Support multiple versions: support all MySQL 5.1 Release Candidate versions, that is, MySQL 5.1.22 RC ~ The latest MySQL 5.1.25 RC;
⑤ Supported character sets: MySQL character sets including GBK, GB2312, UTF-8, Latin1, BIG5 are supported (other character sets have not been tested );
6. good system compatibility: i386 and x86_64 are available, and 32-bit (i386) and 64-bit (x86_64) CPUs and Linux systems are supported;
7. suitable for distributed: it is very suitable for the MySQL Slave distributed system architecture, with no dictionary maintenance cost and no dictionary synchronization problem.

2. disadvantages:
①. Mysqlcft full-text index is only applicable to MyISAM tables, because MySQL only supports FULLTEXT index for MyISAM tables;
② MySQL cannot be statically compiled and installed; otherwise, mysqlcft plug-in cannot be installed;
③ The index file based on the "three-byte crossover splitting algorithm" is slightly larger than the index file based on the "Chinese word segmentation algorithm", such as the massive index files such as ft-hightman. According to my tests, the. MYI index file of mysqlcft full-text index is 2 ~ of the. MYD data file ~ 5 times.

Add
[Mysqld]
Ft_min_word_len = 1

Appendix: Optimization of MySQL configuration files in full-text index applications

[Mysqld]
# Key_buffer specifies the size of the buffer used for the index. in the full-text index, you can add it to get better index processing and query performance.
Key_buffer = 512 M

# Sort_buffer_size is the buffer size that can be used for query sorting. after the full-text index SQL statement is executed, order by is usually used to sort the SQL statement. increasing this will speed up the SQL statement execution time. The allocated memory corresponding to this parameter is exclusive to each connection, and the memory used by 100 connections will be 32 M * 100 = 3200 M
Sort_buffer_size = 32 M

# To perform the group by or order by operation on a table larger than the available memory, add the value of read_rnd_buffer_size to accelerate reading of rows after the sort operation.
Read_rnd_buffer_size = 64 M

# If the TABLE fails or the index fails, the buffer size used in REPAIR TABLE
Myisam_sort_buffer_size = 128 M

# Determine the size limit of the index value of the filesort algorithm used
Max_length_for_sort_data = 64

# Minimum length of keywords used for MySQL full-text index query (do not change this value)
Ft_min_word_len = 1

# Reduce the UPDATE priority and set query priority
Low_priority_updates = 1
Wget http://mysqlcft.googlecode.com/files/mysqlcft-1.0.0-i386-bin.tar.gz
Tar zxvf mysqlcft-1.0.0-i386-bin.tar.gz
Cp mysqlcft. so/usr/local/mysql1/lib/mysql/plugin/
-- Install the engine
Install plugin mysqlcft SONAME 'mysqlcft. so ';
-- Check whether the installation is successful
SELECT * FROM mysql. plugin;
Show plugins;
-- Create an index
Use test
Alter ignore table pa_gposts add fulltext index full_text_title (title) with parser mysqlcft;
-- Index repair
Repair table pa_gposts QUICK;
Performance Comparison
Before adding an index
SELECT * FROM pa_gposts where match (title) AGAINST ('Hospital 'in boolean mode) limit 0, 30;
4 rows in set (1 min 12.69 sec)
The results of this query must contain stopword before and after the word, and the query speed is still slow, because there is no index
Mysql> explain SELECT * FROM pa_gposts where match (title) AGAINST ('Hospital 'in boolean mode) limit;
+ ---- + ------------- + ----------- + ------ + --------------- + ------ + --------- + ------ + -------- + ------------- +
| Id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+ ---- + ------------- + ----------- + ------ + --------------- + ------ + --------- + ------ + -------- + ------------- +
| 1 | SIMPLE | pa_gposts | ALL | NULL | 213193 | Using where |
+ ---- + ------------- + ----------- + ------ + --------------- + ------ + --------- + ------ + -------- + ------------- +
After creating an index
SELECT * FROM pa_gposts where match (title) AGAINST ('Hospital 'in boolean mode) limit 0, 30;
30 rows in set (1.07 sec)
SELECT * FROM pa_gposts WHERE title LIKE '% hospital %' limit 0, 30;
30 rows in set (4.81 sec)

Mysql> explain SELECT * FROM pa_gposts WHERE title LIKE '% hospital %' limit;
+ ---- + ------------- + ----------- + ------ + --------------- + ------ + --------- + ------ + -------- + ------------- +
| Id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+ ---- + ------------- + ----------- + ------ + --------------- + ------ + --------- + ------ + -------- + ------------- +
| 1 | SIMPLE | pa_gposts | ALL | NULL | 213193 | Using where |
+ ---- + ------------- + ----------- + ------ + --------------- + ------ + --------- + ------ + -------- + ------------- +

Mysql> explain SELECT * FROM pa_gposts where match (title) AGAINST ('Hospital 'in boolean mode) limit;
+ ---- + ------------- + ----------- + ---------- + ----------------- + --------------- + --------- + ------ + ------------- +
| Id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+ ---- + ------------- + ----------- + ---------- + ----------------- + --------------- + --------- + ------ + ------------- +
| 1 | SIMPLE | pa_gposts | fulltext | full_text_title | 0 | 1 | Using where |
+ ---- + ------------- + ----------- + ---------- + ----------------- + --------------- + --------- + ------ + ------------- +

SELECT * FROM pa_gposts where match (title) AGAINST ('loan' in boolean mode) limit;
30 rows in set (1.93 sec)
SELECT * FROM pa_gposts WHERE title LIKE '% couples %' limit 0, 30;
30 rows in set (10.17 sec)
SELECT * FROM pa_gposts where match (title) AGAINST ('Moonlight 'in boolean mode) limit 0, 30;
13 rows in set (0.56 sec)
SELECT * FROM pa_gposts WHERE title LIKE '% Moonlight %' limit 0, 30;
13 rows in set (50.98 sec)
Boolean full-text search has the following features:

They do not use the 50% domain value ..
They do not classify rows in the order of weak correlations. You can see this from the above query results: the row with the highest relevance is a row containing two "MySQL", but it is listed at the last position rather than the start position.
Even if FULLTEXT is not available, they can still work, although this method of search execution is very slow.
The full-text parameter and maximum-word-length full-text parameter are applicable.
The stopword applies.

The performance of Boolean full-text search supports the following operators:

+

A leading plus sign indicates that the word must appear at the beginning of each returned line.

-

A leading minus sign indicates that the word cannot appear in any returned row.

(No operator)

In the default state (when + or-is not specified), the word is dispensable, but the row level containing the word is higher. This is similar to the operation when MATCH ()... AGAINST () does not use in boolean mode to modify the program.

> <

These two operators are used to change the effect of a word on the values related to a row. > The operator enhances its influence, while <操作符则减弱其影响。请参见下面的例子。

()

Parentheses are used to divide words into subexpressions. The section enclosed in parentheses can be nested.

~

A leading font size is used as a negative character to deny the influence of a word on the correlation of the row. This is useful for marking "noise (useless information)" words. Rows containing such words are of lower grade than other rows, but because they may be used together with the hyphen (-), the police station will not use any information at any time.

*

An asterisk is used as a truncation character. Different from other symbols, it should be appended to the word to be truncated.

"

A phrase ('"') enclosed in double quotation marks matches only the row containing the input format of the phrase on the word. The full text engine splits a phrase into a word and searches for the word in the FULLTEXT index. Non-word characters do not require strict match: phrase search only requires that the words contained in the search phrase are in the same order and the words are in the same order. For example, "test phrase" matches "test, phrase ".

If the word contained in the phrase does not exist in the index, the result is null. For example, if all words are disabled or the length is smaller than the minimum length of the indexed words, the result is null.

The following example shows some search strings that use Boolean full-text symbols:

'Apple bana'

Search for rows that contain at least two words.

'+ Apple + juice'

Search for rows containing both words.

'+ Apple macintosh'

Find the row that contains the word "apple". if the row also contains the word "macintosh", it is listed as a higher level.

'+ Apple-macintosh'

Find a line that contains the word "apple" but does not contain the word "macintosh.

'+ Apple + (> turnover
Find the rows that contain the words "apple" and "turnover", or those that contain "apple" and "strudel" (in no order ), however, rows that contain "apple turnover" are arranged at a higher level than those that contain "apple strudel.

'Apple *'

Search for rows that contain "apple", "apples", "applesauce", or "applet.

'"Some words "'

Search for rows that contain the original phrase "some words" (for example, rows that contain "some words of wisdom", rather than rows that contain "some noise words ). Note that the '"' symbol of the surrounding phrase is an operator character that defines the phrase. They are not quotation marks surrounding the search string itself.

BitsCN.com

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.