Good php Word Segmentation System-PHPAnalysis no component Word Segmentation System-phpanalysis no component
When collecting the beauty Model Image Library, You need to perform word segmentation on the title. After searching for a long time, you finally found a good word segmentation dictionary.
Introduction to Word Segmentation System: PHPAnalysis word segmentation program uses unicode lexicon and reverse matching pattern word segmentation. Theoretically compatible with more extensive encoding and is especially convenient for UTF-8 encoding. Because PHPAnalysis is a component-less system, the speed is a little slower than that of components. However, in a large number of word segmentation, because the word segmentation completes loading, the more content, on the contrary, the faster the speed, which is a normal phenomenon. PHPAnalysis uses a Hash-like Data Structure for storage. Therefore, for short string word segmentation, only a very small amount of resources is required, which is much higher than the actual situation of loading all entries at a time, and the size of the dictionary does not affect the speed of word segmentation.
The PHPAnalysis word segmentation system performs Word Segmentation Based on string matching. This method is also called mechanical word segmentation, it matches the Chinese character string to be analyzed with the entry in a "sufficiently large" machine dictionary according to certain policies. If a string is found in the dictionary, the match is successful (a word is recognized ). According to the Scanning direction, the string matching and word segmentation methods can be divided into forward matching and reverse matching. According to the priority matching of different lengths, they can be divided into maximum (longest) Matching and minimum (shortest) matching; based on whether it is combined with the part-of-speech tagging process, it can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging. Several common mechanical word segmentation methods are as follows:
1) forward maximum matching (from left to right );
2) reverse maximum matching (from right to left );
3) Minimum segmentation (minimum number of words cut out in each sentence ).
You can also combine the above methods. For example, you can combine the forward maximum matching method and the reverse maximum matching method to form a bidirectional matching method. Due to the word-based feature of Chinese, forward least matching and reverse least matching are rarely used. Generally, the segmentation accuracy of reverse matching is slightly higher than that of forward matching, and there are fewer ambiguities. The statistical results show that the error rate of positive matching is 1/169, and the error rate of reverse matching is 1/245. However, this accuracy is far from meeting the actual needs. The actual word segmentation system uses mechanical word segmentation as a method for initial segmentation, and uses other language information to further improve the accuracy of segmentation. Another method is to improve the scanning method, which is called feature scanning or mark segmentation. In the string to be analyzed, recognition and segmentation are given priority to some words with obvious features, using these words as breakpoints, you can divide the original string into smaller strings and perform mechanical word segmentation to reduce the matching error rate. Another method is to combine word segmentation and word class tagging, and use rich word class information to help word segmentation decisions. In addition, the word segmentation results are verified and adjusted in turn during the tagging process, this greatly improves the accuracy of splitting.
PHPAnalysis performs rough segmentation on the words to be segmented, and then performs secondary reverse matching (RMM) on the short sentences of the rough score, after word segmentation, the word segmentation result is optimized before the final word segmentation result is obtained.
1. Important member variables$ ResultType = 1 indicates the Data Type of the word segmentation result generated. (1 indicates all. 2 indicates the dictionary Vocabulary and a single Chinese-Japanese simplified Chinese character and English character. 3 indicates the dictionary Vocabulary and English word) this variable is usually set using SetResultType ($ rstype. $ NotSplitLen = 5 shortest sentence length $ toLower = false convert all English words to lower case $ differMax = false use the maximum splitting mode to remove dual words $ unitWord = true try to merge words (that is, New Word Recognition) $ differFreq = false use popular keyword priority modeIi. List of Major member functions1. public function _ construct ($ source_charset = 'utf-8', $ target_charset = 'utf-8', $ load_all = true, $ source = '')Function Description: constructor parameter list: $ source_charset source string encoding $ target_charset directory string encoding $ load_all whether the dictionary is fully loaded (this parameter has been voided) $ source string if the input and output are UTF-8, in fact, you can set the text to be operated through the SetSource method instead of using any parameter for initialization.2. public function SetSource ($ source, $ source_charset = 'utf-8', $ target_charset = 'utf-8 ')Function Description: sets the source string parameter list: $ source string $ source_charset source string encoding $ target_charset directory string encoding return value: bool3. public function StartAnalysis ($ optimize = true)Function Description: operation parameter list for starting Word Segmentation: $ optimize whether to optimize the result returned value after word segmentation: void a basic word segmentation process: /// ///$ pa = new PhpAnalysis (); $ pa-> SetSource ('string to be segmented '); // set the word segmentation attribute $ pa-> resultType = 2; $ pa-> differMax = true; $ pa-> StartAnalysis (); // get the result you want $ pa-> GetFinallyIndex (); ////////////////////////////////////////4. public function SetResultType ($ rstype)Function Description: Set the type of the returned result to the operator parameter $ resultType of the member variable. The value of $ rstype is: 1 for all, and 2 for dictionary words and a single Chinese-Japanese simplified character and English, 3 is the dictionary Vocabulary and return value in English: void5. public function GetFinallyKeywords ($ num = 10)Function Description: gets the maximum number of specified entries (usually used to extract document keywords). Parameter List: $ num = 10 returns the number of entries. Return Value: List of keywords separated ","6. public function GetFinallyResult ($ spword = '')Function Description: list of parameters for obtaining the final word splitting result: $ separator return value between spword entries: string7. public function GetSimpleResult ()Function Description: returns an array value from the rough score.8. public function GetSimpleResultAll ()Function Description: Obtain the rough score result attributes (1 Chinese words, 2 ANSI words (including fullwidth), 3 ANSI punctuation marks (including fullwidth ), 4 digits (including fullwidth), 5 Chinese punctuation marks or unrecognized characters) Return Value: array9. public function GetFinallyIndex ()Function Description: obtain the return value of the hash index array: array ('word' => count,...) sort by Frequency10. public function MakeDict ($ source_file, $ target_file = '')Function Description: Compile the text file dictionary into a dictionary parameter list: $ source_file Source Text File $ target_file target file (if not specified, it is the current dictionary) Return Value: void11. public function ExportDict ($ targetfile)Function Description: export all the entries in the current dictionary to the text file parameter list: $ targetfile target file return value: void