How mysql supports full-text retrieval in Chinese

Source: Internet
Author: User

How to implement full-text search in mysql

Without affecting the MySQL system structure and other functions, this solution solves the problem that MySQL currently cannot correctly support full-text Chinese retrieval and optimizes MySQL's performance in Chinese retrieval. (Currently, this software package supports simple forward maximum matching Word Segmentation Based on the dictionary, and supports character sets including UTF-8, GBK, BIG5)

After the test, the results are acceptable. about 1.4 million rows of 1.4 GB Data (excluding the index area space) are searched in the range of 0.0x ~ Matching a small full-text search between 0. x seconds will become very simple.

MySQL has supported full-text indexing in the MyISAM table since a minor version of MySQL 3.23. Due to the special nature of the CJK (China, Japan, and South Korea) Character Set and Its syntax (there is no same interval between words as English), MySQL has not made proper support for the multi-byte wide character set, there is no word splitting capability.

MySQL, as an excellent partner of PHP and other Web scripts, has been widely used in various corners. the retrieval of MySQL has headaches for most developers... WHERE... LIKE %... % method is not only inefficient (Dynamic scan of the entire table can not be slow ?), For special languages such as Chinese, there is also a serious ambiguity problem (because words are the smallest Semantic Unit ). I was also limited by the Search restrictions of MySQL for a period of time and had to seek other solutions, but it was not satisfactory.

--
By the way, there is a technology company called the massive Word Segmentation technology that has done a mysql-chinese hack for a long time, but has not published the source code in time according to the GNU spirit, so I decided to do it myself.

-- Use PHP with MySQL (4.0.x) full-text retrieval example code --

<? Php tutorial
If ($ _ GET ['q'] & $ q = trim ($ _ GET ['q'])
{
Mysql_connect ();
Mysql_select_db ("dot66 ");
$ Q = mysql_escape_string ($ q );
$ R = mysql_query ("select segment ('$ Q ')");
$ N = mysql_fetch_row ($ r );
$ X = explode ("", $ n [0]);
$ M = $ str = '';
$ F = $ t = array ();
Foreach ($ x as $ tmp)
{
If (strlen ($ tmp)> strlen ($ m) $ m = $ tmp;
$ F [] = "/($ tmp)/I ";
$ T [] = '<font color = red> 1 </font> ';
}
$ S1 = "SELECT id, board, owner, chrono, title, SUBSTRING (body, LOCATE ('$ m', body)-50,200) AS xbody ";
$ S2 = "FROM bbs_posts where match (title, body) AGAINST ('$ Q '";
$ S2. = (preg_match ('/[<> ()~ * "+-]/', $ Q )? 'In BOOLEAN mode ':'');
$ S2. = ") LIMIT 100 ";

$ R = mysql_query ("select count (*) $ s2 ");
$ X = mysql_fetch_row ($ r );
$ X = $ x [0];
$ S = $ s1. $ s2;
Echo '<div style = "border: 1px solid red; background-color: # e0e0e0; font-family: tahoma; padding: 8px; "> <B> The Core SQL Query String: </B> <br> '. $ s. '</div> <br> ';
$ R = mysql_query ($ s );
While ($ tmp = mysql_fetch_assoc ($ r ))
{
If ($ pos = strpos ($ tmp ['owner'], '. ') | ($ pos = strpos ($ tmp ['owner'],' @ ')
$ Tmp ['owner'] = substr ($ tmp ['owner'], 0, $ pos ).'.';
$ Str. = "<li> <a href = "? Id = $ tmp [id] "onclick =" return false; "style =" text-decoration: underline; color: blue "title =" click to browse... "> $ tmp [title] </a> <small> <font color = green> (Panel: $ tmp [board]) </font> </small> <br> n ";
$ Str. = "<small>... $ tmp [xbody]... </small> <br> n ";
$ Str. = "<small style =" color: green; font-family: tahoma; "> Author: <u> $ tmp [owner] </u> Date :". date ("Y-m-d H: I", $ tmp ['chrono']). "</small> n ";
$ Str. = "</li> <br> n ";
}
$ F [] = '/x1b [.*? M /';
$ T [] = '';
$ Str = preg_replace ($ f, $ t, $ str );
}
Mysql_close ();
}
?>
<Title> BBS (Panel search: powered by MySQL fulltext +) </title>
<Style type = "text/css tutorial">
Body {line-height: 125% ;}
</Style>
<H2> BBS (Panel search: powered by MySQL fulltext +) <Form method = get style = "margin: 0">
<Input type = text name = q size = 30 value = "<? Php echo $ _ GET ['q'];?> ">
<Input type = submit value = "Search! ">
<Small>
(If you enter a string for search, you can also use simple +/-Oh, including about 1.4 million articles)
</Small>
</Form>
<Ol>
<? Php echo $ str;?>
</Ol>

This was made by a friend himself.

The Word Segmentation technology of the industry's comment on massive technologies is currently considered to be the best Chinese Word Segmentation technology in China. Its word segmentation accuracy exceeds 99%, which also makes the error rate of the search results in the search results very low.

Large http://www.hylanda.com/server/
Download MySQL5.0.37 -- LinuxX86-Chinese +
You do not need to install mysql in advance and then execute
Groupadd mysql
Useradd-g mysql
Cd/usr/local
Gunzip </root/mysql-chplus-5.0.37-linux-i686.tar.gz | tar xvf-
Ln-s/usr/local/mysql-chplus-5.0.37/usr/local/mysql
Cd mysql
Scritps tutorial/mysql_install_db -- user = mysql
Chown-R mysql data
Chown-R mysql.
/Usr/local/mysql/bin/mysqld_safe -- user = mysql &
Test:
Create table test (testid int (4) not null, testtitle varchar (256), testbody varchar (256), fulltext (testtitle, testbody ));
Insert into test values
-> (NULL, '',' '),
-> (NULL, 'Hello you', 'Hello you ');
Select * from test where match (testtitle, testbody) against (' 'in boolean mode );


Hightman's scws
Http://www.hightman.cn/index.php? Myft
Http://www.hightman.cn/bbs/viewthread.php? Tid = 18 & extra = page % 3D1 & page = 1
Because the following articles give up
Reference http://www.blogjava.net/cap
Using the mysql Chinese Word Segmentation plug-in, this is a good thing, but it only supports mysql4.0, and mysql 5.1 beta, just does not support mysql5.0.x currently in use, because there is no ready-made version available, I have to give up

One found but not tested
Reference http://www.lietu.com/doc/MysqlSeg.htm

I. Source
The Parser for Mysql full-text search is split by space by default and cannot directly support Chinese characters for full-text search. Mysql full-text search parser is provided as a plug-in starting from version 5.1. The Chinese word segmentation module complies with the Mysql plug-in format.
II. Environment and Installation
First download MySQL. Then, start the Mysql service in the gb2312 or gbk environment.

# Cd/usr/local/mysql/bin
#./Mysqld-max -- user = mysql -- default-character-set = gbk
Copy seg. so to the path specified by the plug-in. The default value is/usr/local/mysql/lib/mysql.
# Cp./seg. so/usr/local/mysql/lib/mysql/seg. so
Copy the dictionary to the Dic subdirectory in the Data path of Mysql. The default value is/usr/local/mysql/data/Dic/
Go to Mysql:
# Mysql -- default-character-set = gbk
Install plug-ins:
Mysql> install plugin cn_parser SONAME 'seg. so ';
3. Use the Chinese word segmentation plug-in
Create a table:
Mysql & gt; create table t (c VARCHAR (255), FULLTEXT (c) with parser cn_parser) default charset gbk;
Mysql> insert into t VALUES ('test Chinese ');
Mysql> insert into t VALUES ('instructor briefing Day ');
Mysql> insert into t VALUES ('purchase cell phone ');
Query:
Mysql> select match (c) AGAINST ('description'), c FROM t;

+ ----------------------------------- + ----------------------- +

| MATCH (c) AGAINST ('description') | c |

+ ----------------------------------- + ----------------------- +

| 0 | test Chinese |

| 0 | instructor briefing |

| 0 | purchase a mobile phone |

+ ----------------------------------- + ----------------------- +

3 rows in set (0.00 sec)

Mysql> select match (c) AGAINST (''), c FROM t;

+ -------------------------------- + ----------------------- +

| MATCH (c) AGAINST ('') | c |

+ -------------------------------- + ----------------------- +

| 0 | test Chinese |

| 0.58370667695999 | instructor briefing |

| 0 | purchase a mobile phone |

+ -------------------------------- + ----------------------- +

3 rows in set (0.00 sec)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.