MySQL full-text search tips _php Tips

Source: Internet
Author: User
Tags create index modifier mysql tutorial
<< back to Mans. Chinaunix.net


MySQL Reference Manual for version 4.1.0-alpha.



--------------------------------------------------------------------------------

6.8 MySQL Full-Text Search

At 3.23.23, MySQL began to support Full-text indexing and searching. Full-text indexing is an fulltext type index in MySQL. The fulltext index is used for MyISAM tables, which can be created on a CHAR, VARCHAR, or TEXT column by using ALTER table or create index at or after the CREATE table. For large databases, it is very quick to load the data into a table with no Fulltext index and then use ALTER table (or CREATE index) to create the index. It will be very slow to load data into a table that already has a fulltext index.

Full-Text search is done through the MATCH () function.

Mysql> CREATE TABLE articles (
-> ID INT UNSIGNED auto_increment not NULL PRIMARY KEY,
-> title VARCHAR (200),
-> Body TEXT,
-> Fulltext (Title,body)
->);
Query OK, 0 rows Affected (0.00 sec)

Mysql> INSERT into articles VALUES
-> (NULL, ' MySQL Tutorial ', ' DBMS stands for DataBase ... '),
-> (NULL, ' How to use the MySQL efficiently ', ' after you went through a ... '),
-> (NULL, ' Optimising MySQL ', ' in this tutorial we'll show ... '),
-> (NULL, ' 1001 MySQL Tricks ', ' 1. Never run mysqld as root. 2... '),
-> (NULL, ' MySQL vs. Yoursql ', ' in the following database comparison ... '),
-> (NULL, ' MySQL security ', ' when configured properly, MySQL ... ');
Query OK, 6 rows Affected (0.00 sec)
Records:6 duplicates:0 warnings:0

Mysql> SELECT * from articles
-> WHERE MATCH (title,body) against (' database ');
+----+-------------------+------------------------------------------+
| ID | Title | Body |
+----+-------------------+------------------------------------------+
| 5 | MySQL vs. Yoursql | In the following database comparison ... |
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
+----+-------------------+------------------------------------------+
2 rows in Set (0.00 sec)

function MATCH () performs a natural language search for a string against a text set (a set of columns containing one or more columns contained in a fulltext index). The search string as a against () parameter is given. Searches are performed in a way that ignores the case of letters. MATCH () returns a dependency value for each record row in the table. That is, the similarity scale between the search string and the text of the column that the record row is specified in the MATCH () list.

When MATCH () is used in a WHERE clause (see the example above), the returned row of records is automatically sorted in the order of relevance from the highest to the bottom. The correlation value is a non-negative floating-point number. 0 correlation means not similar. Dependencies are calculated based on the number of words in the record line, the number of unique words in the row, the total number of words in the set, and the number of documents (record lines) that contain a special word.

It can also perform a search in a logical mode. This is described in the following sections.

The preceding example is a basic description of the use of the function MATCH (). The rows of records are returned in descending order of similarity.

The next example shows how to retrieve a clear similarity value. If there is no WHERE and there is no ORDER BY clause, the return row is not sorted.

Mysql> SELECT Id,match (title,body) against (' Tutorial ') from articles;
+----+-----------------------------------------+
| ID | MATCH (Title,body) against (' Tutorial ') |
+----+-----------------------------------------+
|                        1 | 0.64840710366884 |
|                                       2 | 0 |
|                        3 | 0.66266459031789 |
|                                       4 | 0 |
|                                       5 | 0 |
|                                       6 | 0 |
+----+-----------------------------------------+
6 rows in Set (0.00 sec)

The following example is a bit more complicated. The query returns similarity and still returns the row of records in the order of decreasing similarity. In order to complete this result, you should specify MATCH () two times. This does not cause additional overhead because the MySQL optimizer notices the same MATCH () call two times and invokes only one Full-text search code.

mysql> SELECT ID, body, MATCH (title,body) against
-> (' Security implications of running MySQL as Root ') as score
-> from Articles WHERE MATCH (title,body) against
-> (' Security implications of running MySQL as Root ');
+----+-------------------------------------+-----------------+
| ID | Body | Score |
+----+-------------------------------------+-----------------+
| 4 | 1. Never run mysqld as root. 2.. | 1.5055546709332 |
| 6 |   When configured properly, MySQL ... | 1.31140957288 |
+----+-------------------------------------+-----------------+
2 rows in Set (0.00 sec)

MySQL uses a very simple profiler to separate the text into words. A "word" is any sequence of characters consisting of text, data, "'" and "_". Any word that appears on the Stopword list or is too short (3 characters or less) will be ignored.

Each appropriate word in the set and query is measured in terms of its importance in the set and the query. In this way, a word that appears in multiple documents will have a lower weight (possibly even a 0 weight) because it has a lower semantic value in this particular set. Otherwise, if the word is less, it will get a higher weight. The weight of the word is then combined to compute the similarity of the record row.

Such a technical work can work well with a large set (in fact, it will be carefully harmonized with it). For very small tables, word classifications are not sufficient to reflect their semantic values adequately, and sometimes this pattern can produce strange results.

Mysql> SELECT * from articles WHERE MATCH (title,body) against (' MySQL ');
Empty Set (0.00 sec)

In the above example, the search term MySQL does not get any results, because the word appears in more than half of the record rows. Similarly, it is effectively treated as a stopword (i.e., a word with a 0 semantic value). This is the ideal behavior-a natural language query should not return each Zekong (second row) from a 1GB table.

A word that matches half of the rows in a table is rarely possible to find related documents. In fact, it may find a number of unrelated documents. As we all know, this happens often when we try to search for something on the Internet through a search engine. For this reason, in this particular dataset, such rows are set to a low semantic value.

At 4.0.1, MySQL can also use the in BOOLEAN MODE modifier to perform a logical full-text search.

Mysql> SELECT * from articles WHERE MATCH (title,body)
-> against (' +mysql-yoursql ' in BOOLEAN MODE);
+----+------------------------------+-------------------------------------+
| ID | Title | Body |
+----+------------------------------+-------------------------------------+
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
| 2 | How to use MySQL efficiently | After you went through a ... |
| 3 | Optimising MySQL | In this tutorial we'll show ... |
| 4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2.. |
| 6 | MySQL Security | When configured properly, MySQL ... |
+----+------------------------------+-------------------------------------+

This query returns all the record rows containing the word MySQL (Note: The 50% threshold is not used), but it does not contain the word yoursql. Note that a search for a logical pattern does not automatically sort the record rows in descending order of similar values. You can see from the results above that the highest similarity (the one that contains MySQL two times) is most in the last place, not the first. A logical full-text search can work even without a fulltext index, but it is slower.

Logical Full-text Search supports the following operators:

+
A leading plus sign indicates that the word must appear in each returned row of records.

-
A leading minus sign indicates that the word must not be present in each returned row of records.

The default (when neither the plus sign nor the minus sign is specified) is arbitrary, but the records that contain it are arranged in a higher order. This imitation does not have the in BOOLEAN MODE modifier MATCH () ... The act of against ().

< >
These two operators are used to change the base value of a word's similarity value. The < operator reduces the base value,> operator to increase it. Refer to the following example.

( )
Parentheses are used to group words in a subexpression.

~
A leading negative number acts like a negation operator, and the base value of the word that causes the line similarity is negative. It is useful for labeling a noise word. A record containing such a word will be ranked lower, but will not be completely excluded, as this can be used-the operator.

*
An asterisk is a truncation operator. Do not want the other operator, it should be appended to a word, not added to the front.

"
The phrase, enclosed in double quotes, matches only the line of records that contains the phrase (literally, as if typed).
Here are some examples:

Apple Banana
Find the record line that contains at least one of the above words
+apple +juice
... Two words are included
+apple Macintosh
... Contains the word "apple," but if the "Macintosh" is included, it will be arranged higher
+apple-macintosh
... Contains "Apple" but does not contain "Macintosh"
+apple + (>pie <strudel)
... Contains "apple" and "pie", or contains "apple" and "strudel" (in any order), but "apple pie" is ranked higher than "apple strudel".
apple*
... Contains "apple", "apples", "applesauce" and "applets"
"Some words"
... Can contain "Some words of wisdom", but not "some noise words"
Limitations of the full text of 6.8.1
All parameters of the match () function must be from a column from the same table, and must be part of the same Fulltext index, unless match () is in BOOLEAN MODE.

The match () column list must match exactly the list of columns defined in a Fulltext index of the table, unless match () is in BOOLEAN MODE.

The against () parameter must be a constant string.
6.8.2 Fine tuning MySQL Full-text Search
Unfortunately, full-text search still has few user-adjustable parameters, although adding some to the TODO arrangement is high. If you have a MySQL source release (see section 2.3 To install a MySQL source release), you can play more control over Full-text search.

Note that Full-text search is carefully adjusted for the best search effect. The behavior of modifying default values, in most cases, will only make the search results worse. Do not modify MySQL source code, unless you know what you are doing!

The minimum length of the indexed word is specified by the MySQL variable Ft_min_word_len. View Chapter 4.5.6.4 Show VARIABLES. Change it to the value you want, and rebuild your fulltext index. (this variable is supported only starting with MySQL 4.0)

The Stopword list can be read from the file specified by the Ft_stopword_file variable. View Chapter 4.5.6.4 Show VARIABLES. After modifying the Stopword list, rebuild your fulltext index. (this variable is supported only from the MySQL 4.0.10)

The 50% threshold selection is determined by the particular measurement pattern selected. To disable it, modify the following line in the ' myisam/ftdefs.h ' file:
#define Gws_in_use Gws_prob

To
#define Gws_in_use Gws_freq

Then recompile MySQL. In this case, you do not need to rebuild the index. Note: Using this will severely reduce the ability of MySQL to provide a sufficient similarity value for MATCH (). If you do need to search for such a common word, it is best to use the search in BOOLEAN MODE instead, which does not adhere to the 50% threshold.

Sometimes, search engine maintainers want to change the operators that are used for logical full-text searches. These are defined by the variable ft_boolean_syntax. View Chapter 4.5.6.4 Show VARIABLES. However, this variable is read-only and its value is set in ' myisam/ft_static.c '.
For these changes, you are required to reconstruct your Fulltext index, and for a MyISAM table, the easiest way to reconstruct the index file is as follows:

mysql> REPAIR TABLE tbl_name QUICK;

6.8.3 Full-Text Search TODO
Make all operations on the fulltext index faster
Proximity (proximity) operator
Support for "Always-index words". They can be any string that the user wants to treat as a word, such as "C + +", "as/400", "TCP/IP", and so on
Support for Full-text search in the MERGE table
Support for multibyte characters
Create a Stopword list according to the language of the data
Stemming (of course, data-dependent languages)
Generic user-suppliable UDF Preparser.
Make the pattern more flexible (by adding some adjustable parameters to the fulltext in the Create/alter TABLE)
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.