Simple Chinese word segmentation based on RMM

Source: Internet
Author: User
Tags ereg explode ord
This procedure is based on RMM Chinese word segmentation thought, the simple Chinese word segmentation, the procedure still has many loopholes, hope the big God pointing .... Optimized the next garbled problem
  1. /**
  2. * Based on RMM Chinese word segmentation (inverse matching method)
  3. * @author Tangpan
  4. * @date 2013-10-12
  5. * @version 1.0.0
  6. **/
  7. Class Splitword {
  8. Public $Tag _dic = Array (); Store dictionary Participle
  9. Public $Rec _dic = Array (); Store a reorganized word breaker
  10. Public $Split _char = "; Separator
  11. Public $Source _str = "; Storing source strings
  12. Public $Result _str = "; Store word breaker result string
  13. Public $limit _lenght = 2;
  14. Public $Dic _maxlen = 28; Maximum length of dictionary morphemes
  15. Public $Dic _minlen = 2; Minimum length of dictionary morphemes
  16. Public Function Splitword () {//initializes the object and automatically executes the member method
  17. $this->__construct ();
  18. }
  19. Public Function __construct () {
  20. $dic _path = dirname (__file__). ' /words.csv '; Pre-load dictionaries to increase word segmentation speed
  21. $fp = fopen ($dic _path, ' R '); Reading words from a thesaurus
  22. while ($line = fgets ($FP, 256)) {
  23. $ws = Explode (' ', $line); Segmentation of words in the word library
  24. $WS [0] = Trim (iconv (' utf-8 ', ' GBK ', $ws [0])); Encoding Conversion
  25. $this->tag_dic[$ws [0]] = true; Indexed by word, ordinal value
  26. $this->rec_dic[strlen ($ws [0]) [$WS [0]] = true; Use the word length and words as the index of the two-dimensional array, and use N as the value to reorganize the thesaurus.
  27. }
  28. Fclose ($FP); Close Word Store
  29. }
  30. /**
  31. * Set Source string
  32. * @param the string to be participle
  33. */
  34. Public Function Setsourcestr ($STR) {
  35. $str = Iconv (' utf-8 ', ' GBK ', $str); Convert Utf-8 encoded characters to GBK encoding
  36. $this->source_str = $this->dealstr ($STR); Preliminary processing of strings
  37. }
  38. /**
  39. * Check String
  40. * @param $str Source string
  41. * @return BOOL
  42. */
  43. Public Function Checkstr ($STR) {
  44. if (Trim ($str) = = ") return; If the string is empty, return directly
  45. if (Ord ($str [0]) > 0x80) return true; is a Chinese character returns true
  46. else return false; return False if not a Chinese character
  47. }
  48. /**
  49. * RMM Segmentation algorithm
  50. * @param $str Pending string
  51. */
  52. Public Function splitrmm ($str = ") {
  53. if (Trim ($str) = = ") return; If the string is empty, it is returned directly
  54. else $this->setsourcestr ($STR); Sets the source string when the string is not empty
  55. if ($this->source_str = = ") return; When the source string is empty, return directly
  56. $split _words = Explode (", $this->source_str); To slice a string with a space
  57. $lenght = count ($split _words); Calculating the length of an array
  58. for ($i = $lenght-1; $i >= 0; $i--) {
  59. if (Trim ($split _words[$i]) = = ") continue; If the character is empty, skip the following code and go directly to the next loop
  60. if ($this->checkstr ($split _words[$i])) {//Check the string if it is a Chinese character
  61. if (strlen ($split _words[$i]) >= $this->limit_lenght) {//String length greater than limit large hours
  62. To reverse match a string
  63. $this->result_str = $this->pregrmmsplit ($split _words[$i]). $this->split_char. $this->result_str;
  64. }
  65. } else {
  66. $this->result_str = $split _words[$i]. $this->split_char. $this->result_str;
  67. }
  68. }
  69. $this->clear ($split _words); Freeing memory
  70. Return Iconv (' GBK ', ' utf-8 ', $this->result_str);
  71. }
  72. /**
  73. * Decomposition of Chinese strings by inverse matching method
  74. * @param $str string
  75. * @return string $retStr participle completed
  76. */
  77. Public Function Pregrmmsplit ($STR) {
  78. if ($str = = ") return;
  79. $splen = strlen ($STR);
  80. $Split _result = Array ();
  81. for ($j = $splen-1; $j >= 0; $j-) {//Inverse match character
  82. if ($splen <= $this->dic_minlen) {//when the length of the character is greater than the minimum length in the dictionary
  83. if ($j = = 1) {//When length is 1 o'clock
  84. $Split _result[] = substr ($str, 0, 2);
  85. } else {
  86. $w = Trim (substr ($str, 0, $this->dic_minlen + 1)); Intercept the first four characters
  87. if ($this->isword ($w)) {//determines if the character exists in the dictionary
  88. $Split _result[] = $w; exists, it is written to the array store
  89. } else {
  90. $Split _result[] = substr ($str, 2, 2); Reverse Storage
  91. $Split _result[] = substr ($str, 0, 2);
  92. }
  93. }
  94. $j =-1; Close the loop;
  95. Break
  96. }
  97. if ($j >= $this->dic_maxlen) $max _len = $this->dic_maxlen; When the length of the character is greater than the maximum word length of the dictionary, the maximum limit length is assigned
  98. else $max _len = $j;
  99. for ($k = $max _len; $k >= 0; $k = $k-2) {//tick for one Chinese character
  100. $w = Trim (substr ($str, $j-$k, $k + 1));
  101. if ($this->isword ($w)) {
  102. $Split _result[] = $w; Save the word
  103. $j = $j-$k-1; Position moved to the position of the matched character
  104. Break The success of the participle jumps out of the current loop and into the next loop
  105. }
  106. }
  107. }
  108. $RETSTR = $this->resetword ($Split _result); Reorganize the string and return the processed string
  109. $this->clear ($Split _result); Freeing memory
  110. return $retStr;
  111. }
  112. /**
  113. * Re-identify and combine participle
  114. * @param $Split _result Recombinant target string
  115. * @return $ret _str reassembly string
  116. */
  117. Public Function Resetword ($Split _result) {
  118. if (Trim ($Split _result[0]) = = ") return;
  119. $Len = count ($Split _result)-1;
  120. $ret _str = ";
  121. $SPC = $this->split_char;
  122. for ($i = $Len; $i >= 0; $i--) {
  123. if (Trim ($Split _result[$i])! = ") {
  124. $Split _result[$i] = iconv (' GBK ', ' utf-8 ', $Split _result[$i]);
  125. $ret _str. = $spc. $Split _result[$i]. ' ';
  126. }
  127. }
  128. $ret _str = preg_replace ('/^ '. $spc. ' /', ', ', $ret _str);
  129. $ret _str = iconv (' utf-8 ', ' GBK ', $ret _str);
  130. return $ret _str;
  131. }
  132. /**
  133. * Check if a word exists in the dictionary
  134. * @param $okWord Check the words
  135. * @return BOOL;
  136. */
  137. Public Function Isword ($okWord) {
  138. $len = strlen ($okWord);
  139. if ($len > $this->dic_maxlen + 1) return false;
  140. else {//match based on two-dimensional array index, whether the word exists
  141. return Isset ($this->rec_dic[$len [$okWord]);
  142. }
  143. }
  144. /**
  145. * Initial processing of strings (with spaces to replace special characters)
  146. * @param $str The source string to be processed
  147. * @return $okStr return the preprocessed string
  148. */
  149. Public Function Dealstr ($STR) {
  150. $SPC = $this->split_char; Copy Separator
  151. $slen = strlen ($STR); Calculate the length of a character
  152. if ($slen = = 0) return; If the character length is 0, return directly
  153. $okstr = "; Initialize variables
  154. $prechar = 0; Character judgment variable (0-blank, 1-English, 2-Chinese, 3-symbol)
  155. for ($i = 0; $i < $slen; $i + +) {
  156. $str _ord = Ord ($str [$i]);
  157. if ($str _ord < 0x81) {//If it is an English character
  158. if ($str _ord < 33) {//blank symbol in English
  159. if ($str [$i]! = ' \ r ' && $str [$i]! = ' \ n ')
  160. $okstr. = $SPC;
  161. $prechar = 0;
  162. Continue
  163. } else if (Ereg (' [@\.%#:\^\&_-] ', $str [$i])) {//If the character of the keyword is a number or English or special character
  164. if ($prechar = = 0) {//when character is blank
  165. $okstr. = $str [$i];
  166. $prechar = 3;
  167. } else {
  168. $okstr. = $spc. $str [$i]; Character is not a white space, a white space character is on the front string
  169. $prechar = 3;
  170. }
  171. } else if (Ereg (' [0-9a-za-z] ', $str [$i])) {//split English number combination
  172. if ((Ereg (' [0-9] ', $str [$i-1]) && ereg (' [a-za-z] ', $str [$i]))
  173. || (Ereg (' [a-za-z] ', $str [$i-1]) && ereg (' [0-9] ', $str [$i]))) {
  174. $okstr. = $spc. $str [$i];
  175. } else {
  176. $okstr. = $str [$i];
  177. }
  178. }
  179. } else {//if the second character of a keyword is a kanji
  180. if ($prechar! = 0 && $prechar! = 2)//If the previous character is non-Chinese and non-whitespace, add a space
  181. $okstr. = $SPC;
  182. if (Isset ($str [$i +1])) {//If it is a Chinese character
  183. $c = $str [$i]. $str [$i +1]; Combine two strings together to form a Chinese text
  184. $n = Hexdec (Bin2Hex ($c)); Converts ASCII code to 16, and then into 10 binary
  185. if ($n > 0xa13f && $n < 0XAA40) {//if Chinese punctuation marks
  186. if ($prechar! = 0) $okstr. = $SPC; Replace Chinese punctuation with an empty
  187. else $okstr. = $SPC; If the previous character is empty, the string directly
  188. $prechar = 3;
  189. } else {//not Chinese punctuation
  190. $okstr. = $c;
  191. $prechar = 2;
  192. }
  193. $i + +; $i plus 1, even if you move to one Chinese character at a time
  194. }
  195. }
  196. }
  197. return $okstr;
  198. }
  199. /**
  200. * Free Memory
  201. * @param $data Staging data
  202. */
  203. Public function Clear ($data) {
  204. Unset ($data); Delete staging data
  205. }
  206. }
  207. ?>
Copy Code
  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.