A good library of PHP sub-parts of speech

Source: Internet
Author: User

Phpanalysis source program Download and demo: PHP word breaker version V2.0 download | PHP Segmentation System Demo | phpanalysis Class API documentation

Original connection Address :http://www.phpbone.com/phpanalysis/

Introduction to Word segmentation system: Phpanalysis word-breaker uses Unicode-based thesaurus, uses reverse-matching pattern segmentation, is theoretically compatible with more extensive coding, and is particularly convenient for utf-8 coding. Because Phpanalysis is a non-component system, so the speed will be slightly slower than the components, but in a large number of participle, because the edge of the word to complete the thesaurus loading, so the more content, but will feel faster, this is normal phenomenon, phpanalysis Thesaurus is a similar hash (hash) Data structure, so for shorter string participle, only a minimal amount of resources is needed, which is much higher than that of one-time loading of all entries, and the size of the thesaurus does not affect the speed of the word breaker execution.
Phpanalysis Word segmentation system is based on string matching word segmentation method, this method is called mechanical word segmentation method, it is to be analyzed in accordance with a certain strategy of the Chinese character string and a "full Big" machine Dictionary of the entry to match, if found in the dictionary of a string, The match succeeds (a word is identified). According to the scanning direction, the string matching segmentation method can be divided into positive matching and inverse matching, according to the case of different length priority matching, can be divided into maximum (longest) match and minimum (shortest) match, according to whether with the part of speech labeling process, but also can be divided into simple word segmentation method and the combination of word segmentation and labeling integration method. Several commonly used mechanical word segmentation methods are as follows:
1) Forward maximum matching method (left to right direction);
2) Inverse maximum matching method (from right to left direction);
3) Minimum segmentation (to minimize the number of words cut out in each sentence).
The above methods can be combined with each other, for example, the forward maximum matching method and the inverse maximum matching method can be combined to form a bidirectional matching method. Because of the character of Chinese word, the positive minimum match and inverse minimum match are seldom used. Generally speaking, the segmentation precision of inverse matching is slightly higher than that of forward matching, and the ambiguity is less. The statistical results show that the error rate of pure positive maximum matching is 1/169, and the error rate of using inverse maximum matching is 1/245. But this kind of precision is far from satisfying the actual need. The actual use of the word segmentation system, are the mechanical word segmentation as a means of first, but also through the use of various other language information to further improve the accuracy of segmentation. Another way is to improve the scanning mode, called feature scanning or glyph segmentation, the first in the string to be analyzed to identify and cut out some of the obvious features of the word, with these words as breakpoints, the original string can be divided into smaller strings to the mechanical participle, thereby reducing the matching error rate. Another method is to combine the word segmentation and the parts of speech, use the rich speech information to help the word segmentation decision, and in the labeling process in turn to test and adjust the segmentation results, so as to greatly improve the accuracy rate of segmentation.
Phpanalysis participle of the word to need word to rough, and then to the short sentence of the rough two times the inverse of the maximum matching method (RMM) of the method of Word segmentation, after the word segmentation results are optimized, and then get the final word segmentation results.

Phpanalysis Class API documentation

first, the more important member variables$resultType = 1 The resulting data type of the word breaker (1 for all, 2 for dictionary words and one for single CJK simplified characters and English, 3 for dictionary words and English) This variable is usually used S Etresulttype ($rstype) This method is set.     $notSplitLen = 5 split sentence shortest length $tolower = False turn the English word all lowercase $differmax = False use the maximum segmentation mode to disambiguation the two-yuan word $unitword = True attempts to merge the word (that is, new word recognition) $differFreq = False use popular Word precedence mode for disambiguationii. List of key member functions1, Public function __construct ($source _charset= ' utf-8 ', $target _charset= ' Utf-8 ', $load _all=true, $source = ') Function Description: constructor argument list: $source _charset source string encoding $target_charset directory string encoding $load_all whether the dictionary is fully loaded (this parameter has been deprecated) $source Source string if the input and output are utf-8, you can actually set the text to be manipulated by using the SetSource method instead of having to initialize with any parameters2, Public function SetSource ($source, $source _charset= ' Utf-8 ', $target _charset= ' utf-8 ')Function Description: Set source string parameter list: $source source string $source_charset Source string encoding $target_charset directory string encoding return value: BOOL3. Public Function startanalysis ($optimize =true)Function Description: Start execution of the word breaker parameter list: whether to try to optimize the result return value after $optimize participle: void a basic word breaker://////////////////////////////////////$pa = new Phpana Lysis (); $pa->setsource (' String to be participle ');//Set Word breaker property $pa->resulttype = 2; $pa->differmax = true; $pa Startanalysis ();//Get the results you want $pa->getfinallyindex ();////////////////////////////////////////4. Public Function Setresulttype ($rstype)Function Description: Sets the type of the returned result is actually the action parameter of the member variable $resulttype $rstype value is: 1 for all, 2 for dictionary words and a single Chinese and Korean simple characters and English, 3 for dictionary vocabulary and English return value: void5, Public function getfinallykeywords ($num = ten)Function Description: Gets the highest frequency of the specified number of entries (typically used to extract document keywords) parameter list: $num = 10 Returns the number of entries return value: A list of keywords separated by ","6, Public function getfinallyresult ($spword = ")Function Description: Gets the final word breaker result parameter list: delimiter return value between $spword entries: string7, Public Function Getsimpleresult ()Function Description: Get the result of the coarse score return value: Array8, Public Function Getsimpleresultall ()Function Description: Get the rough result attribute containing attribute information (1 Chinese words, 2 ANSI words (including full-width), 3 ANSI punctuation (including full-width), 4 digits (including full-width), 5 Chinese punctuation, or unrecognized character) return value: Array9, Public Function Getfinallyindex ()Function Description: Gets the hash index array return value: Array (' word ' =>count,...) sorted by occurrence frequency10, Public function makedict ($source _file, $target _file= ")Function Description: Compile the text file thesaurus into a dictionary parameter list: $source _file source text file $target_file the target file (or the current dictionary if not specified) return value: void11. Public Function Exportdict ($targetfile)Function Description: Export the current dictionary all entries are text file parameter list: $targetfile destination file return value: void

A good library of PHP sub-parts of speech

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.