Mysql full-text index MySQL SELECT... WHERE... LIKE '%... % 'the full-text search method is not only inefficient, but also cannot be used for queries starting with the wildcard "%" and "_". A full table scan is required, which puts a lot of pressure on the database. MySQL provides a full-text index solution to solve this problem, which not only improves performance and efficiency (because MySQL indexes these fields to Optimize search ), in addition, it achieves higher quality search. However, up to now, MySQL does not support full-text Chinese indexing correctly. An important difference between www.2cto.com Chinese and Western text, such as English, is that Western text is in word units and words are separated by spaces. Chinese characters are in the unit of words. Words are composed of one or more words. There is no space between words and words. When you try to use full-text search in a field containing Chinese characters, you will not get the correct result because the Chinese characters do not define words like English spaces and cannot be separated by spaces, INDEX Chinese words. I. Features of MySQL full-text index plug-in mysqlcft: 1. Advantages: ① High Accuracy: Use the self-developed "Three-byte crossover splitting algorithm" to separate Chinese statements, without a Chinese Word Segmentation dictionary, the search accuracy is far higher than that of the Chinese word segmentation algorithm, and the LIKE '%... %. ② Fast query speed: search speed is 3 ~ Faster than LIKE '%... % ~ 50 times, with test results at the end of the article; ③. Standard plug-ins: developed in the form of standard plug-ins for full-text index of MySQL 5.1, without modifying the MySQL source code, and without affecting other functions of MySQL, you can quickly follow up the new MySQL version. ④. Multiple versions are supported: All MySQL 5.1 Release Candidate versions are supported, that is, MySQL 5.1.22 RC ~ Latest MySQL 5.1.25 RC; ⑤. Supported character sets: MySQL character sets including GBK, GB2312, UTF-8, Latin1, BIG5 are supported (other character sets have not been tested); 6. Good system compatibility: it has two versions: i386 and x86_64. It supports 32-bit (i386) and 64-bit (x86_64) CPUs and Linux systems. www.2cto.com 7. It is suitable for Distributed Systems: It is very suitable for MySQL Slave distributed system architecture, there is no dictionary maintenance cost, and there is no dictionary synchronization problem. 2. Disadvantages: ①. mysqlcft full-text index is only applicable to MyISAM tables, because MySQL only supports creating FULLTEXT indexes for MyISAM tables; ②. MySQL cannot be statically compiled and installed; otherwise, mysqlcft plug-in cannot be installed; ③ The index file based on the "Three-byte crossover splitting algorithm" is slightly larger than the index file based on the "Chinese word segmentation algorithm", such as the massive index files such as ft-hightman. According to my tests, the. MYI index file of mysqlcft full-text index is 2 ~ of the. MYD data file ~ 5 times. Add [mysqld] ft_min_word_len = 1 in the configuration file. Appendix: MySQL configuration file Optimization in full-text index applications [mysqld] # key_buffer specifies the buffer size used for index. In full-text index, with this feature, you can get better index processing and query performance. key_buffer = 512 M # sort_buffer_size is the buffer size that can be used for query sorting. After the full-text index SQL statement, ORDER BY is usually used for sorting, it can speed up SQL statement execution. The allocated memory corresponding to this parameter is exclusive to each connection, the memory used BY the 100 connections will be 32 M * 100 = 3200Msort_buffer_size = 32 M # perform the group by or order by operation on tables larger than the available memory, the value of read_rnd_buffer_size should be added to accelerate the reading of read_rnd_buffer_size = 64 M in the row after the sort operation # If the table fails or the index fails, buffer size used in repair table myisam_sort_buffer_size = 128 M # determine the maximum value of the index value of the filesort algorithm used max_length_for_sort_data = 64 # minimum length of keywords used in MySQL full-text index query (do not change this limit value) ft_min_word_len = 1 # reduce the UPDATE priority and set the query priority low_priority_updates = 1 wget http://mysqlcft.googlecode.com/files/mysqlcft-1.0.0-i386-bin.tar.gztar Zxvf mysqlcft-1.0.0-i386-bin.tar.gzcp mysqlcft. so/usr/local/mysql1/lib/mysql/plugin/-- install plugin mysqlcft SONAME 'mysqlcft. so '; -- check whether the installation is successful www.2cto.com SELECT * FROM mysql. plugin; show plugins;
-- Create INDEX use testALTER ignore table pa_gposts add fulltext index full_text_title (title) with parser mysqlcft; -- fix index repair table pa_gposts QUICK; before performance comparison, SELECT * FROM pa_gposts where match (title) AGAINST ('hospital 'in boolean mode) limit;
4 rows in set (1 min 12.69 sec) the query result word must have a stop word before and after, and the query speed is still very slow, because mysql> explain SELECT * FROM pa_gposts where match (title) AGAINST ('hospital 'in boolean mode) limit; + ---- + ------------- + ----------- + ------ + ------------- + ------ + --------- + ------ + -------- + ------------- + | id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra | + ---- + ------------- + ----------- + ------ + --------------- + ------ + --------- + ------ + -------- + ------------- + | 1 | SIMPLE | pa_gposts | ALL | NULL | 213193 | Using where | + ---- + ------------- + ----------- + ------ + ------------- + ------ + --------- + ------ + -------- + --------------- +
After creating the index, SELECT * FROM pa_gposts where match (title) AGAINST ('hospital 'in boolean mode) limit 1.07; 30 rows in set (sec) SELECT * FROM pa_gposts WHERE title LIKE '% hospital %' limit 4.81; 30 rows in set (sec) mysql> explain SELECT * FROM pa_gposts WHERE title LIKE '% hospital %' limit; + ---- + ------------- + ----------- + ------ + --------------- + ------ + --------- + ------ + -------- + ------------- + | id | select_type | tabl E | type | possible_keys | key | key_len | ref | rows | Extra | + ---- + ------------- + ----------- + ------ + ------------- + ------ + --------- + ------ + -------- + ------------- + | 1 | SIMPLE | pa_gposts | ALL | NULL | 213193 | Using where | + ---- + ------------- + ----------- + ------ + ------------- + ------ + --------- + ------ + -------- + ------------- + mysql> explain SELECT * FROM pa_gposts WHERE MATC H (title) AGAINST ('hospital 'in boolean mode) limit 0, 30; www.2cto.com + ---- + ------------- + ----------- + ---------- + hour + --------- + ------ + --------------- + | id | select_type | table | type | possible_keys | key | key_len | ref | | Extra | + ---- + ------------- + ----------- + ---------- + --------------- + ----------------- + --------- + ------ + ------------- + | 1 | SIMPLE | Pa_gposts | fulltext | full_text_title | 0 | 1 | Using where | + ---- + ------------- + --------- + ---------- + signature + --------- + ------ + --------------- + SELECT * FROM pa_gposts where match (title) AGAINST ('loan' in boolean mode) limit 1.93; 30 rows in set (sec) SELECT * FROM pa_gposts WHERE title LIKE '% couples %' limit; 30 rows in set (10.17 sec) SEL ECT * FROM pa_gposts where match (title) AGAINST ('moonlight 'in boolean mode) limit 0.56; 13 rows in set (sec) SELECT * FROM pa_gposts WHERE title LIKE '% %' limit 50.98; 13 rows in set (50% sec) Boolean full-text search has the following features: they do not use domain values .. They do not classify rows in the order of weak correlations. You can see this from the above query results: the row with the highest relevance is a row containing two "MySQL", but it is listed at the last position rather than the start position. Even if FULLTEXT is not available, they can still work, although this method of search execution is very slow. The full-text parameter and maximum-word-length full-text parameter are applicable. The stopword applies. The performance of Boolean full-text search supports the following operators: + a leading plus sign indicates that the word must appear at the beginning of each returned line. -A leading minus sign indicates that the word cannot appear in any returned row. (No operator) in the default state (when + or-is not specified), The word can have none, but the row level containing the word is higher. This is similar to the operation when MATCH ()... AGAINST () does not use in boolean mode to modify the program.> <These two operators are used to change the effect of a word on the values related to a row.> The operator enhances its influence, while the <operator weakens its influence. See the following example. Www.2cto.com () brackets are used to divide words into subexpressions. The Section enclosed in parentheses can be nested. ~ A leading font size is used as a negative character to deny the influence of a word on the correlation of the row. This is useful for marking "noise (useless information)" words. Rows containing such words are of lower grade than other rows, but because they may be used together with the hyphen (-), the Police Station will not use any information at any time. * Asterisks are used as truncation characters. Different from other symbols, it should be appended to the word to be truncated. "A phrase enclosed in double quotation marks ('"') only matches rows containing the phrase input format on the word surface. The full text engine Splits a phrase into a word and searches for the word in the FULLTEXT index. Non-word characters do not require strict match: phrase search only requires that the words contained in the search phrase are in the same order and the words are in the same order. For example, "test phrase" matches "test, phrase ". If the word contained in the phrase does not exist in the index, the result is null. For example, if all words are disabled or the length is smaller than the minimum length of the indexed words, the result is null. The following example shows some search strings that use Boolean full-text symbols: 'apple bana' searches for rows that contain at least two words. '+ Apple + juice' searches for the rows in which both words are contained. '+ Apple macintosh': Search for rows containing the word "apple". If these rows contain the word "macintosh", the column is of a higher level. '+ Apple-macintosh': Find a line that contains the word "apple" but does not contain the word "macintosh. '+ Apple + (> turnover <strudel)' www.2cto.com searches for lines containing the word "apple" and "turnover, or the rows that contain "apple" and "strudel" (in no particular order). However, the rows that contain "apple turnover" are more organized than those that contain "apple strudel. 'Apple * 'searches for lines that contain "apple", "apples", "applesauce", or "applet. '"Some words"' looks for rows that contain the original phrase "some words" (for example, rows that contain "some words of wisdom, rather than the line containing "some noise words ). Note that the '"' symbol of the surrounding phrase is an operator character that defines the phrase. They are not quotation marks surrounding the search string itself.