mysql-Chinese Full Text Search

Source: Internet
Author: User
Tags mysql tutorial urlencode

I. Overview

MySQL Full-text search uses the correlation between query keyword and query column content to retrieve, can use full-text index to improve the speed of matching.

Second, the grammar

12 MATCH (col1,col2,...) AGAINST (expr [search_modifier])search_modifier: { INBOOLEAN MODE | WITHQUERY EXPANSION }

For example: SELECT * from Tab_name WHERE MATCH (' column name 1, column Name 2 ... Column name n ') against (' Word 1 Word 2 word 3 ... Word m ');

That is, match corresponds to the column to match, and against is what you are looking for.

The table here needs to be of type MyISAM, col1, col2 must be char, varchar, or text type, and a full-text index (col2 index) should be established on col1 and Fulltext before querying.

Iii. Methods of Retrieval

1. Natural language search: in NATURAL LANGUAGE MODE

2. Boolean search: In Boolean MODE

Eliminate half of the words that match the line above, for example, every line has this word, then use this to check, will not find any results, this is very useful in the number of records, the reason is that the database think it is not meaningful to find all the rows, at this time, this is almost regarded as a stopword (interrupt word), but if there are only two lines of records, there is no ghost can be found, because each word appears 50% (or more), to avoid this situation, please use in BOOLEAN MODE.

Features in BOOLEAN mode:

• Do not exclude more than 50% compliant row.

• Do not automatically reverse-sort by relevance.

• You can search for fields that do not have Fulltext index, but are very slow.

• Limit the longest and shortest strings.

• Apply Stopwords.

Search Grammar Rules:

+ must have (the data bar that does not contain the keyword is ignored).

-Not available (excludes specified keywords, which are ignored).

> Increase the weight value of the matching data for this bar.

< reduce the weight value of the matching data.

~ Turns its correlation from positive to negative, indicating that owning the word decreases the correlation (but not like-excludes it), but only the lower weight value.

* Universal Word, unlike other syntax in front, this is to be followed by a string.

"" Enclose a sentence in double quotation marks to make it fully consistent and not chaizi.

-+ denotes and, that must be included. -represents not, which must not be included. That is: The return record must contain Apple and cannot contain banner. SELECT * from articles WHERE MATCH (title,content) against (' +apple-banana ' in BOOLEAN MODE);--Between Apple and banana is a space, and a space indicates or. That is, the return record contains at least one of Apple, banana. SELECT * from articles WHERE MATCH (title,content) against (' Apple banana ' in BOOLEAN MODE);--The return record must contain Apple, and banana can include or Do not include, if included, will get a higher weight. SELECT * from articles WHERE MATCH (title,content) against (' +apple banana ' in BOOLEAN MODE);--~ is a familiar XOR operator. The return record must contain Apple, and if it contains banana, it will reduce the weight. --but it is not +apple-banana strict, because the latter if it contains banana does not return at all. SELECT * FROM articles WHERE MATCH (title,content) against (' +apple ~banana ' in BOOLEAN MODE);--return must include "Apple banana" or a required must also contain a record of "Apple Orange". -If you include both Apple banana and Apple Orange Records, Apple Banana has a higher weight than apple orange. SELECT * from articles WHERE MATCH (title,content) against (' +apple + (>banana <orange) ' in BOOLEAN MODE);

  

3. Query extension search: With query EXPANSION

Iv. condition Limitations of MySQL Full text search

1, under MySQL5.6, only the MyISAM table supports full-text search. The InnoDB engine table above MySQL5.6 also provides support for full-text indexing.

2, the corresponding fields to establish Fulltext index

V. System variables related to full-text search:

Ft_min_word_len = Minimum license character for full-text search (default 4, viewable by show VARIABLES like ' Ft_min_word_len '), Chinese is usually two words is a word, so Chinese words need to change this value to 2 best.

Vi. Summary of matters

1, the default search is not the case, to divide the case, columne character set to change from UTF8 to Utf8_bin.

2. Preset MATCH ... Against are sorted by relevance, from high to low.

3. The field in MATCH (title, content) must be identical to the field in Fulltext (title, content).

If you just Tancha title or content one field, you have to build another fulltext (title) or fulltext (content), and because of this, the MATCH () field must not cross table, but the other two search methods seem to be possible.

4, MySQL does not support the Chinese full-text index, the reason is very simple: different from English, the Chinese text is attached to write together, the middle no MySQL can find the place of participle, up to now MySQL5.6 version is so, but there are workarounds, is the whole sentence of Chinese word, and according to UrlEncode, Location code, base64, pinyin, etc. are encoded so that they are stored in the database in the form of "Letter + number".

Vii. Experimental part

Step 1 Configure My.ini, add the following at the end of My.ini:

# The minimum license character for a full-text search is 2 characters or kanji Ft_min_word_len = 2

After completing the "Restart MySQL service" and using SHOW VARIABLES like ' ft_min_word_len ' query to get the correct result value of 2.

Step 1 Create a data table

CREATE table if not EXISTS ' zzx_articles ' (    ' id ' int (ten) unsigned not NULL auto_increment,     ' title ' char (254) Defau Lt null COMMENT ' title ',    ' content ' text COMMENT ' contents ',    ' author ' char ($) Default NULL COMMENT ' author ',    ' Title_fc ' Cha R (254) default NULL COMMENT ' title of Word breaker ',    ' content_fc ' text COMMENT ' content of word breaker ',    PRIMARY KEY  (' id '),    fulltext KEY ' Zzx_title_fc ' (' Title_fc '),/** build Full-text index */    fulltext KEY ' Zzx_content_fc ' (' Content_fc '),/** build Full-text index *    / Fulltext KEY ' Zzx_title_con_fc ' (' Title_fc ', ' Content_fc ')/** build full-text index */) Engine=myisam DEFAULT Charset=utf8;

Step 2 Insert test data

INSERT into ' zzx_articles ' (TITLE_FC,CONTENT_FC) VALUES (' MySQL Tutorial Linux red ', ' DBMS stands for DataBase ok '), Use MySQL "Well '", ' after you went through Blue '), (' Optimizing MySQL OK ', ' on this tutorial we'll optimizing '), (' MySQL vs t His yoursql blue Red ', ' 1. Never run mysqld as Root Red '), (' MySQL Tricks blue ', ' in the following database '), (' MySQL Security ', ' when configured proper Ly, MySQL '), (' Zhonghua ', ' People's Republic of China '), (' China Harmony ', ' Shanghai Harmony '), (' Pollution of the city ', ' Hello I am Beijing People '), (' Beijing Spirit ', ' Innovation Patriotic Inclusion Audacity ')

Step 3 search for grammar rules, sorting experiments

Description: Matching statement match (Col1,col2,...) Against (expr [search_modifier]), when the match is complete, returns the weight value of this data (the sum of the weight values 1≈ the matching result weights of each word), and we use this weight value "from high to low" to refine the query results

? Lab 1: Full-text search only for Title_fc indexed fields and display weight values for each piece of data

SELECT *,match (TITLE_FC) against (' Optimizing OK Red Blue ') as Title_score from   zzx_articles    WHERE MATCH (title_f c) against (' Optimizing OK Red Blue ' in BOOLEAN MODE) Order by Title_score DESC

Results such as:

Summarize:

1. When there is no +-such filter symbol, these keywords are "or" relations, namely: Match optimizing, either match OK, or match red, or match blue.

2. Through the above experiment, found that when a data has more than one keyword match (such as: Red blue), the weight value of this data will be slightly higher: This data weight value ≈optimizing weight value + OK weight value + red weight value + blue weight value theoretically, The more keywords a data can match, the higher the weight value of the data, the more forward the ranking.

? Lab 2: Filter criteria: Must contain "red" keywords

SELECT *,match (TITLE_FC) against (' Optimizing OK Red ') as Title_score from   zzx_articles WHERE MATCH (TITLE_FC) AGAINS T (' Optimizing OK +red ' in BOOLEAN MODE) Order by Title_score DESC

  

Summary: Using the filter symbol +, indicates that the query results, any one of the data must contain the word "red", does not include the word "red" line is ignored.

? Experiment 3: Filter condition: must include "red" keyword, if the matching line also contains "blue" keyword, then this data will increase the weight:

SELECT *,match (TITLE_FC) against (' Optimizing OK Red Blue ') as Title_score from   zzx_articles    WHERE MATCH (title_f c) against (' Optimizing OK +red blue ' in BOOLEAN MODE) Order by Title_score DESC

or the wording:

SELECT *,match (TITLE_FC) against (' Optimizing OK Red >blue ') as Title_score from   zzx_articles    WHERE MATCH (Tit LE_FC) against (' Optimizing OK +red >blue ' in BOOLEAN MODE) Order by Title_score DESC

Summary: Compared to experiment 2, when included in the red line, if also contains the Blue keyword, the weight does increase (such as: id=4 this).

? Experiment 4: Filter Condition: Must contain "red" keyword, and cannot contain "blue" keyword

SELECT *,match (TITLE_FC) against (' Optimizing OK Red Blue ') as Title_score from   zzx_articles    WHERE MATCH (title_f c) against (' Optimizing OK +red-blue ' in BOOLEAN MODE) Order by Title_score DESC

Summary: Visible +-These two symbols are meant to mean "and", that is: must contain the RED keyword and cannot contain blue keywords.

? Experiment 5: Filter Condition: Must contain "red" keyword, if the match to the line also contains "blue" keyword will reduce the weight of this data

SELECT *,match (TITLE_FC) against (' Optimizing OK +red ~blue ') as Title_score from   zzx_articles    WHERE MATCH (title _FC) against (' Optimizing OK +red ~blue ' in BOOLEAN MODE) Order by Title_score DESC

Summary: This experiment does not see the obvious effect, but the ~ filter is really to reduce the weight of this character

? Experiment 6: Filter: Matches rows that contain the word "red" and "Linux", or rows that contain "red" and "blue" (no order) However, rows that contain "apple Linux" have higher weight values than those that contain "Apple blue".

SELECT *,match (TITLE_FC) against (' +red  + (>linux <blue) ') as Title_score from   zzx_articles    WHERE MATCH (TITLE_FC) against (' +red  + (>linux <blue) ' in BOOLEAN MODE) Order by Title_score DESC

  

? Experiment 7: Filter criteria: Match keywords starting with RE, or data rows starting with BL

SELECT *,match (TITLE_FC) against (' re* bl* ') as Title_score from   zzx_articles    WHERE MATCH (TITLE_FC) against (' re* bl* ' in BOOLEAN MODE) Order by Title_score DESC

Summary: NOTE * is written in the back Oh, at this time the equivalent of like fuzzy match, no weight value

? Lab 8: Filter Criteria: Match Find string "to use MySQL" keyword

SELECT *,match (TITLE_FC) against (' to use MySQL ') as Title_score from   zzx_articles    WHERE MATCH (TITLE_FC) AGA INST (' To use MySQL ' ' in BOOLEAN MODE) Order by Title_score DESC

Summary: At this time is the double quotation marks in the string as a keyword, if not double quotation marks will be used MySQL three key words to match, the difference between the two;

? Experiment 9: On the basis of experiment 1, the weight value of blue is ignored (pay attention to compare with experiment 1)

SELECT *,match (TITLE_FC) against (' Optimizing OK Red ') as Title_score from   zzx_articles    WHERE MATCH (TITLE_FC) AG Ainst (' Optimizing OK Red Blue ' in BOOLEAN MODE) Order by Title_score DESC

Summary: On the basis of experiment 1, the Blue keyword in the Select field condition is removed, but the Blue keyword remains in the where. My intention is to properly match the "Optimizing OK Red blue" key words, but do not want the weight value of blue (ignoring the weight value of blue).

The result of the query is a slightly lower weight value for the data that contains the Blue keyword.

By "weight reduction"-ignoring some keyword weight values can reduce the weight of some of the data, and then affect the sorting.

? Experiment 10: On the basis of experiment 9, add several red keywords in the Select field, where the key words remain the same (note compared to experiment 1 experiment 9).

SELECT *,match (TITLE_FC) against (' Optimizing OK Red Red ') as Title_score from   zzx_articles    WHERE MATCH (titl E_FC) against (' Optimizing OK Red Blue ' in BOOLEAN MODE) Order by Title_score DESC

Summary: It is found that as long as the weight values of the data containing the Red keyword are increased, the sorting is also changed. It is indicated that the weight value of some data can be increased by the way of "lifting weight"-repeating several keyword weights, which will affect the sorting.

? Lab 11: Full-Text search for Title_fc and Content_fc two fields at the same time

SELECT *,match (TITLE_FC) against (' Optimizing OK Red Blue ') as Title_score,match (CONTENT_FC) against (' Optimizing OK red Blue ') as Content_score from   zzx_articles    WHERE MATCH (TITLE_FC,CONTENT_FC) against (' Optimizing OK red blue ' in BOOLEAN MODE) Order by Title_score Desc,content_score

  

Summary: Through the experiment found, and successfully took the Content_fc field matching weight value, sorting is the first priority by the title field weight Descending sort, and minor by Content_fc weight descending sort. In addition I found OK this keyword in the "Title_fc char (254)" field match to get a match value of 2.1xxxxxxx, but the "Content_fc text" field to match the weight value goes to 0, which is MySQL to each English word weight value of the given has its own algorithm , we have no right to interfere. So when we find that some words have a weight value of 0 or even negative, don't be overly tangled, because MySQL has its own algorithm.

About sorting, first by title_score field weight descending sort, minor by content_score weight descending sort, such collation looks more scientific, but I think the excellent search engine is not as simple as this, continue the following experiment:

? Experiment 12: Further refine the collation look at an SQL statement prototype, query field 1, Field 2, and add the sum of field 1 and field 2 for each data as field 3 fields:

Select field 1, Field 2, Field 1 + Field 2 as field 3 from       table name where .....       

The following Title_fc and Content_fc two weight values are summed, placed in the new field score1, and sorted by Score1 first, Title_score second, Content_score again:

SELECT *,match (TITLE_FC) against (' Optimizing OK Red Blue ') as Title_score,match (CONTENT_FC) against (' Optimizing OK red Blue ') as Content_score,match (TITLE_FC) against (' Optimizing OK Red Blue ') + MATCH (CONTENT_FC) against (' Optimizing OK Red Blue ') as Score1 from   zzx_articles    WHERE MATCH (TITLE_FC,CONTENT_FC) against (' Optimizing OK Red blue ' in BOOL EAN MODE) Order by Score1 Desc,title_scort

Summary: Relatively speaking, if the Title_fc and Content_fc are matched, should be given the top ranking it. So first of all according to its TITLE_FC and Content_fc two weight value of the sum ranking, the secondary consideration Title_fc, CONTENT_FC sort.

? Experiment 13: Another angle to look at a sort of SQL statement prototype, if the gender field value is "1" display "male" otherwise show "female"

Select *,if (sex=1, "male", "female") as Ssva from table name where id = 1

  

My new sort of thinking: if Title_fc and Content_fc both match on the row to do the first sort, then on the only match on the Title_fc to do the minor sort, only match on Content_fc again to sort. (For the Order of the experiment 5 is not scientific: if there is a CONTENT_FC keyword matching weight is high, resulting in the sum of Score1 value is also high, but the matching weight of Title_fc to 0, because the Score1 value is high but row to the front this is not necessarily science? )

  

SELECT *,match (TITLE_FC) against (' Optimizing OK Red Blue ') as Title_score,match (CONTENT_FC) against (' Optimizing OK red Blue ') as Content_score,if (Match (TITLE_FC) against (' Optimizing OK Red Blue ') > 0 and MATCH (CONTENT_FC) against (' Op timizing OK Red Blue ') > 0,1,0) as Score1 from   zzx_articles    WHERE MATCH (TITLE_FC,CONTENT_FC) against (' Optimiz ing OK red blue ' in BOOLEAN MODE] ORDER by Score1 Desc,title_score

Summary: If TITLE_FC and Content_fc are matched, take precedence, but should also consider the problem of excessive local weight value oh.

? Experiment 14: Optimize experiment 13 to support more complex conditional ordering look at an SQL statement prototype, case and then END structure:

Case < single-value expression > when              < expression value > then <sql statement or return value > when              < expression value > then <sql statement or return value >               ...              When < expression value > then <sql statement or return value >              ELSE <sql Statement or return value >          END

There is a table query: greater than or equal to 80 indicates "excellent", greater than or equal to 60 is displayed as "pass", less than 60 points shown as "failed."

Select (case when language >=80 then ' excellent ' when language >=60 then ' pass ' ELSE ' fail ' END) as language (case when              math >=80 then ' excellent ' W  HEN Math >=60 Then ' else ' failed ' end ' as math (case when              English >=80 then ' excellent ' when English >=60 then ' pass ' else ' fail ' end) As English from       table_name

# example test a bit

SELECT *, (case if Id<8 then ' 5 ' when id=8 then ' 4 ' ELSE ' 0 ' END) as Newfield from       zzx_articles       where id>5

My new sort of thinking: if the weights of Title_fc and Content_fc "at the same time greater than 0 and equal" for the first order, "at the same time greater than 0 and not equal" for the secondary sort, "Title_fc more than 0 to sort again", if using if () seemingly bad implementation, see the following statement:

SELECT *,match (TITLE_FC) against (' Optimizing OK Red Blue ') as Title_score,match (CONTENT_FC) against (' Optimizing OK red Blue ') as Content_score,             (case when MATCH (TITLE_FC) against (' Optimizing OK Red Blue ') > 0 and MATCH (TITLE_FC) A Gainst (' Optimizing OK red blue ') = match (CONTENT_FC) against (' Optimizing OK Red Blue ') Then ' 3 ' when match (TITLE_FC) A Gainst (' Optimizing OK Red Blue ') > 0 and MATCH (TITLE_FC) against (' Optimizing OK Red Blue ') <> match (Content_f c) against (' Optimizing OK Red Blue ') Then ' 2 ' when MATCH (TITLE_FC) against (' Optimizing OK Red Blue ') > 0 Then ' 1 ' EL SE ' 0 ' END) as Score1 from       zzx_articles        WHERE MATCH (TITLE_FC,CONTENT_FC) against (' Optimizing OK Red blue ' in BOO LEAN MODE) Order by Score1 Desc,title_score

Summary: The order of this experiment may not be scientific, but lead to a more complex rules of the sorting method, angle, a variety of combinations to make a more reasonable sort, in order to make your search engine more intelligent. To offer, perhaps you have a better sort, please also share to me.

? Experiment 15: Chinese Full text Search

MySQL does not support Chinese full-text search, because the Chinese word is attached to the writing, not like the English word between the spaces separated. The solution is Chinese word segmentation (see other articles on Chinese participle), if your MySQL is installed on the Windows platform, you can use full-text indexing without transcoding directly, as in this example. But if your MySQL is installed on Linux, you need to do a transcoding (urlencode/base64_encode/json_encode/location/pinyin) and other programs, see other blog posts.

SELECT *,match (TITLE_FC) against (' China Beijing Harmony Security ') as Title_score,match (CONTENT_FC) against (' China Beijing Harmony Security ') as Content_score, (case when MATCH (TITLE_FC) against (' China Beijing Harmony Security ') > 0 and MATCH (CONTENT_FC) against (' China Beijing Harmony Security ') > 0 Then ' 5 ' when match (TITLE_FC) against (' China Beijing Harmony Security ') > 0 and MATCH (CONTENT_FC) against (' Zhonghua Beijing Harmony security ') = 0 Then ' 4 ' else ' 0 ' end) as Score1 from      zzx_articles       WHERE MATCH (TITLE_FC,CONTENT_FC) against (' China Beijing Harmony Security ' in BOOLEAN MODE) Order by Score1 Desc,title_score Desc,content_score DESC

All articles are only for study records, not for any commercial use has been technology sharing.

mysql-Chinese Full Text Search

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.