The realization method of MySQL supporting Chinese full-text search

Source: Internet
Author: User
Tags mysql in mysql tutorial strlen

MySQL tutorial supports the implementation of Chinese Full-text search

Without affecting the system structure of MySQL and other functions, this paper solves the shortcoming that MySQL can not correctly support Chinese Full-text search and optimizes the performance of MySQL for Chinese retrieval processing. (At present, this package supports simple forward maximum matching participle according to dictionary, support includes UTF-8, GBK, BIG5 ...) In the character set)

The test effect is fair, 1.4 million about 1.4G of data (excluding the index area space) retrieval is probably between 0.0x ~ 0.x seconds. Paired with a small full-text search will become very simple.

MySQL has supported the creation of Full-text indexed fields in the MyISAM table since a small version of 3.23. Because of the character set of CJK (China, Japan and Korea) and its syntactic specificity (there is no such gap between words and words), MySQL has not been able to support the wide character set for multibyte, nor any word segmentation capability.

and MySQL as PHP, such as the perfect partner for Web Script, has been widely used in every corner, for its search makes most developers headache, with SELECT ... WHERE ... Like%...% way not only inefficient (dynamic scan of the whole table can not slow?) , there are also serious ambiguity problems in special languages such as Chinese (because the word is the smallest semantic unit). I have also been constrained by MySQL's search restrictions for a while and have had to look for other solutions, but not very well.

--
Incidentally, there is a call for a large amount of word segmentation technology company, very early did Mysql-chinese hack, but not in accordance with the GNU spirit timely release of the source code, so decided to do it himself.

--using PHP in conjunction with MySQL (4.0.x) Full-text Search example code--

<?php Tutorial
if ($_get[' Q '] && $q = Trim ($_get[' Q '))
{
Mysql_connect ();
mysql_select_db ("dot66");
$q = mysql_escape_string ($q);
$r = mysql_query ("Select SEGMENT (' $q ')");
$n = Mysql_fetch_row ($r);
$x = Explode ("", $n [0]);
$m = $str = ';
$f = $t = Array ();
foreach ($x as $tmp)
{
if (strlen ($tmp) > strlen ($m)) $m = $tmp;
$f [] = "/($tmp)/I";
$t [] = ' <font color=red>1</font> ';
}
$s 1 = "Select Id,board,owner,chrono,title,substring (body,locate (' $m ', body) -50,200) as XBody";
$s 2 = "from Bbs_posts WHERE MATCH (title,body) against (' $q '";
$s 2. = (Preg_match ('/[<> () ~* ' +-]/', $q)? ' In BOOLEAN MODE ': ');
$s 2. = ") LIMIT 100";

$r = mysql_query ("Select COUNT (*) $s 2");
$x = Mysql_fetch_row ($r);
$x = $x [0];
$s = $s 1. $s 2;
Echo ' <div style= "border:1px solid red;background-color: #e0e0e0; font-family:tahoma;padding:8px;" ><b>the Core SQL Query string:</b><br> '. $s. ' </div><br> ';
$r = mysql_query ($s);
while ($tmp = Mysql_fetch_assoc ($r))
{
if ($pos = Strpos ($tmp [' owner '], '. ')) | | ($pos = Strpos ($tmp [' owner '], ' @ '))
$tmp [' owner '] = substr ($tmp [' owner '],0, $pos). "
$str. = "<li><a href="? id= $tmp [id] "onclick=" return false; "style=" Text-decoration:underline;color:blue " title= "Click to browse ..." > $tmp [title]</a> <small><font color=green> (kanban: $tmp [board]) </font></ Small><br>n ";
$str. = "<small> ... $tmp [xbody] ... </small><br>n";
$str. = "<small style=" color:green;font-family:tahoma; >author: <u> $tmp [owner]</u> Date: ". Date ("Y-m-d h:i", $tmp [' Chrono ']). "</small>n";
$str. = "</li><br><br>n";
}
$f [] = '/x1b[.*?m/';
$t [] = ';
$str = Preg_replace ($f, $t, $STR);
}
Mysql_close ();
}
?>
<title>bbs (Kanban Retrieval: Powered by MySQL fulltext+) </title>
<style type= "Text/css Tutorial" >
body {line-height:125%;}
</style>
<form method=get style= "margin:0" >
<input type=text name=q size=30 value= "<?php echo $_get[' Q '];? > ">
<input type=submit value= "search!" >
<small>
(Random input string retrieval, also can use simple + + Oh, included about 1.4 million articles)
</small>
</form>
<ol>
<?php echo $str;?>
</ol>

It's a friend of mine.

Industry review of the vast technology of the word segmentation technology is currently considered the best Chinese word segmentation technology, the accuracy of the word segmentation is more than 99%, which also makes the search results in the search result error rate is very low.

Mass http://www.hylanda.com/server/
Download mysql5.0.37--linuxx86-chinese+
Do not need to install MySQL in advance and then execute
Groupadd MySQL
useradd-g mysql mysql
cd/usr/local
Gunzip </root/mysql-chplus-5.0.37-linux-i686.tar.gz|tar xvf-
Ln -s/usr/local/mysql-chplus-5.0.37/usr/local/mysql
CD mysql
Scritps tutorial/mysql_install_db--user=mysql
Chown -R MySQL data
chown-r mysql.
/usr/local/mysql/bin/mysqld_safe--user=mysql
   test:
Crea Te table Test (testid int (4) NOT NULL, Testtitle varchar (256), Testbody varchar (256), Fulltext (Testtitle,testbody));
INSERT into test values
-> (null, ' hello ', ' How are you? '),
-> (null, ' Good hello ', ' good Hello ');
Select * from Test where match (t esttitle,testbody) against (' Hello ' in boolean mode);


Hightman's SCWS
Http://www.hightman.cn/index.php?myft
Http://www.hightman.cn/bbs/viewthread.php?tid=18&extra=page%3D1&page=1
Because the following articles give up
Reference Http://www.blogjava.net/cap
Using the MySQL Chinese word breaker, this thing is very good, but only support mysql4.0, and MySQL 5.1 beta, just do not support my current mysql5.0.x because there is no ready-made version can be used, but also to give up

Also found a but not tested
Reference http://www.lietu.com/doc/MysqlSeg.htm

One, source
MySQL Full-text search parser default is by Space segmentation, can not directly support Full-text search Chinese. Starting with version 5.1, MySQL Full-text search parser is provided as a plugin. Rabbit Hunter provides Chinese word segmentation module in accordance with MySQL plug-in format.
Second, the environment and installation
First download the Mysql5.1 version. Then start the MySQL service in the gb2312 or GBK environment.

# Cd/usr/local/mysql/bin
#./mysqld-max--user=mysql--DEFAULT-CHARACTER-SET=GBK
Copy seg.so to the path specified by the plugin by default/usr/local/mysql/lib/mysql
# CP./seg.so/usr/local/mysql/lib/mysql/seg.so
Copy the dictionary to the Dic subdirectory under the MySQL data path, by default/usr/local/mysql/data/dic/
Enter MySQL:
# MySQL--DEFAULT-CHARACTER-SET=GBK
Install plugins:
Mysql>install PLUGIN cn_parser soname ' seg.so ';
Third, the use of Chinese participle plug-in
To create a table:
Mysql>create TABLE T (c VARCHAR (255), Fulltext (c) with PARSER Cn_parser) default CharSet GBK;
Mysql>insert into T VALUES (' Test Chinese ');
Mysql>insert into T VALUES (' teacher says tomorrow meeting ');
Mysql> INSERT into T VALUES (' Buy mobile phone ');
Inquire:
Mysql> SELECT MATCH (c) against (' description '), C from T;

+-----------------------------------+-----------------------+

| MATCH (c) against (' description ') | C |

+-----------------------------------+-----------------------+

| 0 | Test Chinese |

| 0 | The teacher said the meeting tomorrow |

| 0 | Buy Mobile Phone |

+-----------------------------------+-----------------------+

3 Rows in Set (0.00 sec)

Mysql> SELECT MATCH (c) against (' say '), C from T;

+--------------------------------+-----------------------+

| MATCH (c) against (' say ') | C |

+--------------------------------+-----------------------+

| 0 | Test Chinese |

| 0.58370667695999 | The teacher said the meeting tomorrow |

| 0 | Buy Mobile Phone |

+--------------------------------+-----------------------+

3 Rows in Set (0.00 sec)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.