Word breaker Introduction: Phpanalysis Word segmentation program uses a thesaurus of Unicode, using reverse matching pattern word segmentation, theoretically compatible with a wider range of coding, and Utf-8 coding is particularly convenient. Because Phpanalysis is a non-component system, so the speed will be slightly slower than the components, but in a large number of participle, because the edge of the word to complete the thesaurus loading, so the more content, but will feel faster, this is normal phenomenon, phpanalysis Thesaurus is a similar hash (hash) Data structure, so for shorter string participle, only a minimal amount of resources is needed, which is much higher than that of one-time loading of all entries, and the size of the thesaurus does not affect the speed of the word breaker execution.
Phpanalysis Word segmentation system is based on string matching word segmentation method, this method is called mechanical word segmentation method, it is to be analyzed in accordance with a certain strategy of the Chinese character string and a "full Big" machine Dictionary of the entry to match, if found in the dictionary of a string, The match succeeds (a word is identified). According to the scanning direction, the string matching segmentation method can be divided into positive matching and inverse matching, according to the case of different length priority matching, can be divided into maximum (longest) match and minimum (shortest) match, according to whether with the part of speech labeling process, but also can be divided into simple word segmentation method and the combination of word segmentation and labeling integration method. Several commonly used mechanical word segmentation methods are as follows:
1) Forward maximum matching method (left to right direction);
2) Inverse maximum matching method (from right to left direction);
3) Minimum segmentation (to minimize the number of words cut out in each sentence).
The above methods can be combined with each other, for example, the forward maximum matching method and the inverse maximum matching method can be combined to form a bidirectional matching method. Because of the character of Chinese word, the positive minimum match and inverse minimum match are seldom used. Generally speaking, the segmentation precision of inverse matching is slightly higher than that of forward matching, and the ambiguity is less. The statistical results show that the error rate of pure positive maximum matching is 1/169, and the error rate of using inverse maximum matching is 1/245. But this kind of precision is far from satisfying the actual need. The actual use of the word segmentation system, are the mechanical word segmentation as a means of first, but also through the use of various other language information to further improve the accuracy of segmentation. Another way is to improve the scanning mode, called feature scanning or glyph segmentation, the first in the string to be analyzed to identify and cut out some of the obvious features of the word, with these words as breakpoints, the original string can be divided into smaller strings to the mechanical participle, thereby reducing the matching error rate. Another method is to combine the word segmentation and the parts of speech, use the rich speech information to help the word segmentation decision, and in the labeling process in turn to test and adjust the segmentation results, so as to greatly improve the accuracy rate of segmentation.
Phpanalysis participle of the word to need word to rough, and then to the short sentence of the rough two times the inverse of the maximum matching method (RMM) of the method of Word segmentation, after the word segmentation results are optimized, and then get the final word segmentation results.
Phpanalysis Class API documentation
first, the more important member variables$resultType = 1 The resulting data type of the word breaker (1 for all, 2 for dictionary words and one for single CJK simplified characters and English, 3 for dictionary words and English) This variable is usually used S Etresulttype ($rstype) This method is set. $notSplitLen = 5 split sentence shortest length $tolower = False turn the English word all lowercase $differmax = False use the maximum segmentation mode to disambiguation the two-yuan word $unitword = True attempts to merge the word (that is, new word recognition) $differFreq = False use popular Word precedence mode for disambiguationii. List of key member functions1, Public function __construct ($source _charset= ' utf-8 ', $target _charset= ' Utf-8 ', $load _all=true, $source = ') Function Description: constructor argument list: $source _charset source string encoding $target_charset directory string encoding $load_all whether the dictionary is fully loaded (this parameter has been deprecated) $source Source string if the input and output are utf-8, you can actually set the text to be manipulated by using the SetSource method instead of having to initialize with any parameters2, Public function SetSource ($source, $source _charset= ' Utf-8 ', $target _charset= ' utf-8 ')Function Description: Set source string parameter list: $source source string $source_charset Source string encoding $target_charset directory string encoding return value: BOOL3. Public Function startanalysis ($optimize =true)Function Description: Start execution of the word breaker parameter list: whether to try to optimize the result return value after $optimize participle: void a basic word breaker://////////////////////////////////////$pa = new Phpana Lysis (); $pa->setsource (' String to be participle ');//Set Word breaker property $pa->resulttype = 2; $pa->differmax = true; $pa Startanalysis ();//Get the results you want $pa->getfinallyindex ();////////////////////////////////////////4. Public Function Setresulttype ($rstype)Function Description: Sets the type of the returned result is actually the action parameter of the member variable $resulttype $rstype value is: 1 for all, 2 for dictionary words and a single Chinese and Korean simple characters and English, 3 for dictionary vocabulary and English return value: void5, Public function getfinallykeywords ($num = ten)Function Description: Gets the highest frequency of the specified number of entries (typically used to extract document keywords) parameter list: $num = 10 Returns the number of entries return value: A list of keywords separated by ","6, Public function getfinallyresult ($spword = ")Function Description: Gets the final word breaker result parameter list: delimiter return value between $spword entries: string7, Public Function Getsimpleresult ()Function Description: Get the result of the coarse score return value: Array8, Public Function Getsimpleresultall ()Function Description: Get the rough result attribute containing attribute information (1 Chinese words, 2 ANSI words (including full-width), 3 ANSI punctuation (including full-width), 4 digits (including full-width), 5 Chinese punctuation, or unrecognized character) return value: Array9, Public Function Getfinallyindex ()Function Description: Gets the hash index array return value: Array (' word ' =>count,...) sorted by occurrence frequency10, Public function makedict ($source _file, $target _file= ")Function Description: Compile the text file thesaurus into a dictionary parameter list: $source _file source text file $target_file the target file (or the current dictionary if not specified) return value: void11. Public Function Exportdict ($targetfile)Function Description: Export the current dictionary all entries are text file parameter list: $targetfile destination file return value: void
Test code:
12345678910111213141516171819202122 |
<!DOCTYPE html>
<meta http-equiv=
"Content-Type"
content=
"text/html;charset=utf-8"
/>
<title>test</title>
<body>
<?php
require_once
‘phpanalysis2.0/phpanalysis.class.php‘
;
$pa
=
new
PhpAnalysis();
$pa
->SetSource(
"PHPAnalysis分词系统是基于字符串匹配的分词方法进行分词的,这种方法又叫做机械分词方法,它是按照一定的策略将待分析的汉字串与 一个“充分大的”机器词典中的词条进行配,若在词典中找到某个字符串,则匹配成功(识别出一个词)。按照扫描方向的不同,串匹配分词方法可以分为正向匹配 和逆向匹配;按照不同长度优先匹配的情况,可以分为最大(最长)匹配和最小(最短)匹配;按照是否与词性标注过程相结合,又可以分为单纯分词方法和分词与 标注相结合的一体化方法。常用的几种机械分词方法如下: "
);
$pa
->resultType=2;
$pa
->differMax=true;
$pa
->StartAnalysis();
$arr
=
$pa
->GetFinallyIndex();
echo
"<pre>"
;
print_r(
$arr
);
echo "</pre>"
;
?>
</body>
|
The effect is as follows:
Introduction of Word Segmentation system: phpanalysis participle Program