It is not wise for PHP to perform Chinese word segmentation. p is a simple word segmentation program based on a dictionary file I have found online. (Note: The dictionary is in gdbm format, and the key indicates word value indicates word frequency. about 40 thousand common words are used )? Php // Chinese word segmentation system
It is not wise for PHP to perform Chinese word segmentation: p
The following is a simple word segmentation program based on a dictionary file I found online.
(Note: The dictionary is gdbm, and the key is the word value, which is the word frequency. about 40 thousand common words are used)
// Simple implementation of the Chinese word segmentation system
// Sentence unit: any character with an ascii value <128
// Common dubyte symbols: ". ,? "";:! ¥ ...... % $ # @ ^ & * () [] {}| \/"'
// You can consider taking part in the common Chinese text: the sum is not enough (however, there are special words such as 'Playing' Zheng He '...: p)
// Calculate the time
Function getmicrotime (){
List ($ usec, $ sec) = explode ('', microtime ());
Return (float) $ usec (float) $ sec );
}
$ Time_start = getmicrotime ();
// Lexicographic class
Class ch_dictionary {
Var $ _ id;
Function ch_dictionary ($ fname = ''){
If ($ fname! = ''){
$ This-> load ($ fname );
}
}
// Load the dictionary (gdbm data file) according to the file name)
Function load ($ fname ){
$ This-> _ id = dba_popen ($ fname, 'R', 'gdbm ');
If (! $ This-> _ id ){
Echo 'failed' to open the dictionary. ($ fname)
\ N ';
Exit;
}
}
// Return frequency based on words.-1 is returned if no words are returned.
Function find ($ word ){
$ Freq = dba_fetch ($ word, $ this-> _ id );
If (is_bool ($ freq) $ freq =-1;
Return $ freq;
}
}
// Word Segmentation: (reverse)
// Cut the input string into sentences in the forward direction, and then combine the word segmentation to return an array composed of words.
Class ch_word_split {
Var $ _ mb_mark_list; // The full-angle punctuation of common split sentences
Var $ _ word_maxlen; // maximum possible length of a single word (Chinese characters)
Var $ _ dic; // Dictionary...
Var $ _ ignore_mark; // true or false
Function ch_word_split (){
$ This-> _ mb_mark_list = array (',','','. ','! ','? ',':','...... ',');
$ This-> _ word_maxlen = 12; // 12 Chinese characters
$ This-> _ dic = NULL;
$ This-> _ ignore_mark = true;