MySQL tutorial supports the implementation of Chinese Full-text search
Without affecting the system structure of MySQL and other functions, this paper solves the shortcoming that MySQL can not correctly support Chinese Full-text search and optimizes the performance of MySQL for Chinese retrieval processing. (At present, this package supports simple forward maximum matching participle according to dictionary, support includes UTF-8, GBK, BIG5 ...) In the character set)
The test effect is fair, 1.4 million about 1.4G of data (excluding the index area space) retrieval is probably between 0.0x ~ 0.x seconds. Paired with a small full-text search will become very simple.
MySQL has supported the creation of Full-text indexed fields in the MyISAM table since a small version of 3.23. Because of the character set of CJK (China, Japan and Korea) and its syntactic specificity (there is no such gap between words and words), MySQL has not been able to support the wide character set for multibyte, nor any word segmentation capability.
and MySQL as PHP, such as the perfect partner for Web Script, has been widely used in every corner, for its search makes most developers headache, with SELECT ... WHERE ... Like%...% way not only inefficient (dynamic scan of the whole table can not slow?) , there are also serious ambiguity problems in special languages such as Chinese (because the word is the smallest semantic unit). I have also been constrained by MySQL's search restrictions for a while and have had to look for other solutions, but not very well.
--
Incidentally, there is a call for a large amount of word segmentation technology company, very early did Mysql-chinese hack, but not in accordance with the GNU spirit timely release of the source code, so decided to do it himself.
--using PHP in conjunction with MySQL (4.0.x) Full-text Search example code--
<?php Tutorial
if ($_get[' Q '] && $q = Trim ($_get[' Q '))
{
Mysql_connect ();
mysql_select_db ("dot66");
$q = mysql_escape_string ($q);
$r = mysql_query ("Select SEGMENT (' $q ')");
$n = Mysql_fetch_row ($r);
$x = Explode ("", $n [0]);
$m = $str = ';
$f = $t = Array ();
foreach ($x as $tmp)
{
if (strlen ($tmp) > strlen ($m)) $m = $tmp;
$f [] = "/($tmp)/I";
$t [] = ' <font color=red>1</font> ';
}
$s 1 = "Select Id,board,owner,chrono,title,substring (body,locate (' $m ', body) -50,200) as XBody";
$s 2 = "from Bbs_posts WHERE MATCH (title,body) against (' $q '";
$s 2. = (Preg_match ('/[<> () ~* ' +-]/', $q)? ' In BOOLEAN MODE ': ');
$s 2. = ") LIMIT 100";
$r = mysql_query ("Select COUNT (*) $s 2");
$x = Mysql_fetch_row ($r);
$x = $x [0];
$s = $s 1. $s 2;
Echo ' <div style= "border:1px solid red;background-color: #e0e0e0; font-family:tahoma;padding:8px;" ><b>the Core SQL Query string:</b><br> '. $s. ' </div><br> ';
$r = mysql_query ($s);
while ($tmp = Mysql_fetch_assoc ($r))
{
if ($pos = Strpos ($tmp [' owner '], '. ')) | | ($pos = Strpos ($tmp [' owner '], ' @ '))
$tmp [' owner '] = substr ($tmp [' owner '],0, $pos). "
$str. = "<li><a href="? id= $tmp [id] "onclick=" return false; "style=" Text-decoration:underline;color:blue " title= "Click to browse ..." > $tmp [title]</a> <small><font color=green> (kanban: $tmp [board]) </font></ Small><br>n ";
$str. = "<small> ... $tmp [xbody] ... </small><br>n";
$str. = "<small style=" color:green;font-family:tahoma; >author: <u> $tmp [owner]</u> Date: ". Date ("Y-m-d h:i", $tmp [' Chrono ']). "</small>n";
$str. = "</li><br><br>n";
}
$f [] = '/x1b[.*?m/';
$t [] = ';
$str = Preg_replace ($f, $t, $STR);
}
Mysql_close ();
}
?>
<title>bbs (Kanban Retrieval: Powered by MySQL fulltext+) </title>
<style type= "Text/css Tutorial" >
body {line-height:125%;}
</style>
<form method=get style= "margin:0" >
<input type=text name=q size=30 value= "<?php echo $_get[' Q '];? > ">
<input type=submit value= "search!" >
<small>
(Random input string retrieval, also can use simple + + Oh, included about 1.4 million articles)
</small>
</form>
<ol>
<?php echo $str;?>
</ol>
It's a friend of mine.
Industry review of the vast technology of the word segmentation technology is currently considered the best Chinese word segmentation technology, the accuracy of the word segmentation is more than 99%, which also makes the search results in the search result error rate is very low.
Mass http://www.hylanda.com/server/
Download mysql5.0.37--linuxx86-chinese+
Do not need to install MySQL in advance and then execute
Groupadd MySQL
useradd-g mysql mysql
cd/usr/local
Gunzip </root/mysql-chplus-5.0.37-linux-i686.tar.gz|tar xvf-
Ln -s/usr/local/mysql-chplus-5.0.37/usr/local/mysql
CD mysql
Scritps tutorial/mysql_install_db--user=mysql
Chown -R MySQL data
chown-r mysql.
/usr/local/mysql/bin/mysqld_safe--user=mysql
test:
Crea Te table Test (testid int (4) NOT NULL, Testtitle varchar (256), Testbody varchar (256), Fulltext (Testtitle,testbody));
INSERT into test values
-> (null, ' hello ', ' How are you? '),
-> (null, ' Good hello ', ' good Hello ');
Select * from Test where match (t esttitle,testbody) against (' Hello ' in boolean mode);
Hightman's SCWS
Http://www.hightman.cn/index.php?myft
Http://www.hightman.cn/bbs/viewthread.php?tid=18&extra=page%3D1&page=1
Because the following articles give up
Reference Http://www.blogjava.net/cap
Using the MySQL Chinese word breaker, this thing is very good, but only support mysql4.0, and MySQL 5.1 beta, just do not support my current mysql5.0.x because there is no ready-made version can be used, but also to give up
Also found a but not tested
Reference http://www.lietu.com/doc/MysqlSeg.htm
One, source
MySQL Full-text search parser default is by Space segmentation, can not directly support Full-text search Chinese. Starting with version 5.1, MySQL Full-text search parser is provided as a plugin. Rabbit Hunter provides Chinese word segmentation module in accordance with MySQL plug-in format.
Second, the environment and installation
First download the Mysql5.1 version. Then start the MySQL service in the gb2312 or GBK environment.
# Cd/usr/local/mysql/bin
#./mysqld-max--user=mysql--DEFAULT-CHARACTER-SET=GBK
Copy seg.so to the path specified by the plugin by default/usr/local/mysql/lib/mysql
# CP./seg.so/usr/local/mysql/lib/mysql/seg.so
Copy the dictionary to the Dic subdirectory under the MySQL data path, by default/usr/local/mysql/data/dic/
Enter MySQL:
# MySQL--DEFAULT-CHARACTER-SET=GBK
Install plugins:
Mysql>install PLUGIN cn_parser soname ' seg.so ';
Third, the use of Chinese participle plug-in
To create a table:
Mysql>create TABLE T (c VARCHAR (255), Fulltext (c) with PARSER Cn_parser) default CharSet GBK;
Mysql>insert into T VALUES (' Test Chinese ');
Mysql>insert into T VALUES (' teacher says tomorrow meeting ');
Mysql> INSERT into T VALUES (' Buy mobile phone ');
Inquire:
Mysql> SELECT MATCH (c) against (' description '), C from T;
+-----------------------------------+-----------------------+
| MATCH (c) against (' description ') | C |
+-----------------------------------+-----------------------+
| 0 | Test Chinese |
| 0 | The teacher said the meeting tomorrow |
| 0 | Buy Mobile Phone |
+-----------------------------------+-----------------------+
3 Rows in Set (0.00 sec)
Mysql> SELECT MATCH (c) against (' say '), C from T;
+--------------------------------+-----------------------+
| MATCH (c) against (' say ') | C |
+--------------------------------+-----------------------+
| 0 | Test Chinese |
| 0.58370667695999 | The teacher said the meeting tomorrow |
| 0 | Buy Mobile Phone |
+--------------------------------+-----------------------+
3 Rows in Set (0.00 sec)