Mysql full-text search FULLTEXT

Source: Internet
Author: User
Tags mysql tutorial

MySQL full-text search

Up to 3.23.23, MySQL began to support full-text indexing and searching. Full-text index is a FULLTEXT index in MySQL. FULLTEXT indexes are used for MyISAM tables. You can use alter table or create index to CREATE them on CHAR, VARCHAR, or TEXT columns at or after create table. For large databases, it is very fast to load data to a TABLE without FULLTEXT indexes and then CREATE an INDEX using alter table (or create index. Loading data to a table with FULLTEXT indexes will be very slow.

Full-text search is completed using the MATCH () function.

Mysql> create table articles (
-> Id int unsigned AUTO_INCREMENT not null primary key,
-& Gt; title VARCHAR (200 ),
-> Body TEXT,
-> FULLTEXT (title, body)
-> );
Query OK, 0 rows affected (0.00 sec)

Mysql> insert into articles VALUES
-> (NULL, 'mysql Tutorial ', 'dbms stands for DataBase ...'),
-> (NULL, 'how To Use MySQL Efficiently ', 'after you went through ...'),
-> (NULL, 'optimising mysql', 'In this tutorial we will show ...'),
-> (NULL, '1001 MySQL Tricks ', '1. Never run mysqld as root. 2 ....'),
-> (NULL, 'mysql vs. yoursql', 'In the following database comparison ...'),
-> (NULL, 'mysql security', 'When configured properly, MySQL ...');
Query OK, 6 rows affected (0.00 sec)
Records: 6 Duplicates: 0 Warnings: 0

Mysql> SELECT * FROM articles
-> Where match (title, body) AGAINST ('database ');
+ ---- + ------------------- + -------------------------------------------- +
| Id | title | body |
+ ---- + ------------------- + -------------------------------------------- +
| 5 | MySQL vs. YourSQL | In the following database comparison... |
| 1 | MySQL Tutorial | DBMS stands for DataBase... |
+ ---- + ------------------- + -------------------------------------------- +
2 rows in set (0.00 sec)

The function MATCH () searches for a string in a natural language against a text set (a column set that contains one or more columns in a FULLTEXT index. The search string is given as a parameter of AGAINST. Search to ignore uppercase/lowercase letters. MATCH () returns a correlation value for each record row in the table. That is, the similarity scale between the text of the specified column in the MATCH () List of the search string and record row.

When MATCH () is used in a WHERE clause (see the preceding example), the returned record rows are automatically sorted in the descending order of relevance. The correlation value is a non-negative floating point number. Zero correlation means they are not similar. Correlation is calculated based on the number of words in the record row, the number of unique words in the row, the total number of words in the set, and the number of documents (record row) that contain a special word.

It can also perform a logical search. This is described in the following chapter.

The preceding example describes the usage of the MATCH () function. Record rows are returned in descending order of similarity.

The next example shows how to retrieve a specific similarity value. If neither the WHERE clause nor the order by clause exists, the returned rows are not sorted.

Mysql> SELECT id, MATCH (title, body) AGAINST ('utorial ') FROM articles;
+ ---- + ----------------------------------------- +
| Id | MATCH (title, body) AGAINST ('utorial ') |
+ ---- + ----------------------------------------- +
| 1 | 0.64840710366884 |
| 2 | 0 |
| 3 | 0.66266459031789 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
+ ---- + ----------------------------------------- +
6 rows in set (0.00 sec)

The following example is more complex. Query returns similarity and returns record rows in descending order of similarity. To complete this result, you should specify MATCH () twice. This will not cause additional overhead, because the MySQL optimizer will notice two identical MATCH () calls and only call the full-text search code once.

Mysql> SELECT id, body, MATCH (title, body) AGAINST
-> ('Security implications of running MySQL as root') AS score
-> FROM articles where match (title, body) AGAINST
-> ('Security implications of running MySQL as root ');
+ ---- + ------------------------------------- + ----------------- +
| Id | body | score |
+ ---- + ------------------------------------- + ----------------- +
| 4 | 1. Never run mysqld as root. 2... | 1.5055546709332 |
| 6 | When configured properly, MySQL ...... | 1.31140957288 |
+ ---- + ------------------------------------- + ----------------- +
2 rows in set (0.00 sec)

MySQL uses a very simple parser to separate text into words. A word is any character sequence consisting of text, Data, '', and. Any "word" that appears on the stopword list or is too short (3 characters or less) will be ignored.

Each appropriate word in a set and query is measured based on its importance in the set and query. In this way, a word that appears in multiple documents has a lower weight (or even a zero weight), because in this particular set, it has a lower Semantic Value. Otherwise, if the word is less, it will get a higher weight. Then, the word weight is combined to calculate the similarity of record rows.

Such a technical work can work well with a large set (in fact, it will be careful with it ). For very small tables, word classification is insufficient to fully reflect their semantic values. Sometimes this pattern may produce strange results.

Mysql> SELECT * FROM articles where match (title, body) AGAINST ('mysql ');
Empty set (0.00 sec)

In the above example, the search term MySQL does not get any result because it appears in more than half of the record lines. Similarly, it is effectively processed as a stopword (that is, a word with zero Semantic Value ). This is the ideal behavior-a query in a natural language should not return each row (second row) from a 1 GB table ).

Words that match half of the record rows in a table may rarely find relevant documents. In fact, it may find many irrelevant documents. As we all know, this often happens when we try to search for something through a search engine on the Internet. For this reason, in this special dataset, such a row is set with a low Semantic Value.

By 4.0.1, MySQL can also use the in boolean mode modifier to perform a logical full-text search.

Mysql> SELECT * FROM articles where match (title, body)
-> AGAINST ('+ MySQL-yoursql' in boolean mode );
+ ---- + ------------------------------ + ------------------------------------- +
| Id | title | body |
+ ---- + ------------------------------ + ------------------------------------- +
| 1 | MySQL Tutorial | DBMS stands for DataBase... |
| 2 | How To Use MySQL Efficiently | After you went through a... |
| 3 | Optimising MySQL | In this tutorial we will show... |
| 4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2... |
| 6 | MySQL Security | When configured properly, MySQL... |
+ ---- + ------------------------------ + ------------------------------------- +

This query returns all record lines containing the word MySQL (note: the threshold value of 50% is not used), but it does not contain the word YourSQL. Note that searches in a logical mode do not automatically record rows in descending order of similar values. You can see from the above results that the highest similarity value (the one that contains MySQL twice) is listed at the end, not the first. A logical full-text search can work even if there is no FULLTEXT index, but it is slower.

Logical full-text search supports the following operators:

+
A leading plus sign indicates that the word must appear in each returned record row.

-
A leading minus sign indicates that the word must not appear in each returned record row.

The default (when neither the plus sign nor the negative sign is specified) is random, but the records containing it will be arranged a little higher. This imitates the action of MATCH ()... AGAINST () without the in boolean mode modifier.

<>
These two operators are used to change the base value of the similarity value of a word. <Operator reduces the base value,> operator adds it. See the following example.

()
Parentheses are used to group words in subexpressions.

~
A leading negative number acts like a negative operator. The base value of a word that causes row similarity is negative. It is useful for marking a noise word. A record containing such words will be arranged a little lower, but not completely excluded, because the-operator can be used.

*
An asterisk is the truncation operator. If you do not want other operators, it should be appended to a word and not prefixed.

"
The phrase, enclosed in double quotation marks, matches only the record lines that contain the phrase (literally, as typed.
Here are some examples:

Apple banana
Find record lines containing at least one of the above words
+ Apple + juice
Both words are included.
+ Apple macintosh
... Contains the word "apple", but if "macintosh" is included at the same time, it will be arranged in a higher order.
+ Apple-macintosh
... Contains "apple" but not "macintosh"
+ Apple + (> pie <strudel)
... Include "apple" and "pie", or include "apple" and "strudel" (in any order ), however, "apple pie" is arranged a little higher than "apple strudel ".
Apple *
... Contains "apple", "apples", "applesauce", and "applet"
"Some words"
... Can contain "some words of wisdom", but not "some noise words"

Full text restrictions
All parameters of the MATCH () function must be columns from the same table and must be part of the same FULLTEXT index, unless MATCH () is in boolean mode.

The MATCH () column list must exactly MATCH the column list defined IN a FULLTEXT index of the table, unless MATCH () is in boolean mode.

The parameter of AGAINST () must be a constant string.

Fine-tune MySQL full-text search
Unfortunately, full-text search still has very few user-adjustable parameters, although the addition of some is very high on TODO. If you have a MySQL source code release (see Chapter 2.3 to install a MySQL source code release), you can use more control over full-text search.

Note that full-text search is the best search effect and has been carefully adjusted. Modifying the default value will only make the search result worse in most cases. Do not modify the source code of MySQL unless you know what you are doing!

The minimum length of the indexed word is specified by the MySQL variable ft_min_word_len. See section 4.5.6.4 show variables. Change it to the expected value and recreate your FULLTEXT index. (This variable is only supported since MySQL 4.0)

The stopword list can be read from the file specified by the ft_stopword_file variable. See section 4.5.6.4 show variables. After modifying the stopword list, rebuild your FULLTEXT index. (This variable is only supported since MySQL 4.0.10)

50% the threshold value is determined by the selected special measurement mode. To disable it, modify the following line in the 'myisam/ftdefs. H' file: # define GWS_IN_USE GWS_PROB

Change to: # define GWS_IN_USE GWS_FREQ

Then re-compile MySQL. In this case, no re-indexing is required. Note: using this will significantly reduce MySQL's ability to provide sufficient similarity values for MATCH. If you really need to search for such a public word, you 'd better use the in boolean mode search instead, which does not comply with the 50% threshold.

Sometimes search engine maintainers want to change the operators used for logical full-text search. These are defined by the ft_boolean_syntax variable. See section 4.5.6.4 show variables. However, this variable is read-only and its value is set in 'myisam/ft_static.c.
For these changes, you need to re-create your FULLTEXT index. For a MyISAM table, the following statement is the easiest way to re-create the index file:

Mysql> repair table tbl_name QUICK;

Full-text search TODO
Make all operations on FULLTEXT indexes faster
Proximity Operator
Support for "always-index words. They can be any string that users want to process AS a word, such AS "C ++", "AS/400", "TCP/IP", etc.
Support full-text search in the MERGE table
Support for multi-byte characters
Create a stopword list based on the Data Language
Stemming (of course, dependent on the Data Language)
Generic user-suppliable UDF preparser.
Make the mode more flexible (add some adjustable parameters to FULLTEXT in CREATE/alter table)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.