Phpanalysis simple and convenient PHP participle system class

Source: Internet
Author: User
Tags hash

Phpanalysis Word segmentation program uses Unicode thesaurus, using reverse matching mode participle, theoretically compatible coding is more extensive, and Utf-8 coding is particularly convenient. Because Phpanalysis is a system without components, so the speed will be slightly slower than the component, but in a large number of participle, because the edge Word to complete the library load, so the more content, it will feel faster, this is the normal phenomenon.

Phpanalysis's thesaurus is stored in a hash-like data structure, therefore, for relatively short string participle, only need to account for a minimum of resources, than that one load all the entry of the actual and much higher, and the size of the word library capacity does not affect the speed of word execution.

Phpanalysis Word segmentation system is based on string matching word segmentation method, this method is also called the mechanical Word segmentation method, it is to be analyzed in accordance with a certain strategy of the Chinese character string with a "full large" machine dictionary entries to match, if found in the dictionary a string, The match succeeds (a word is identified).

According to the different scanning direction, the string matching segmentation method can be divided into forward matching and reverse matching. According to the case of different length preference, it can be divided into the maximum (longest) matching and the minimum (shortest) matching, according to whether or not the process of POS tagging, but also can be divided into simple word segmentation method and the combination of Word segmentation and annotation integration method. Several commonly used mechanical participle methods are as follows:

Forward maximum matching method (from left to right direction).
Inverse maximum matching method (from right to left direction).
Minimal segmentation (the smallest number of words cut in each sentence).
These methods can also be combined with each other, for example, the forward maximum matching method and the reverse maximum matching method can be combined to form a bidirectional matching method. Due to the characters of Chinese words, the forward minimum matching and inverse minimum matching are seldom used. Generally speaking, the segmentation precision of reverse matching is slightly higher than that of forward matching, and the ambiguity phenomenon is less.

The statistic results show that the error rate of single positive maximum matching is 1/169, and the error rate of simply using reverse maximum matching is 1/245. But this precision is far from satisfying the actual need. The actual use of the word segmentation system, is the mechanical participle as a primary means, but also by using a variety of other language information to further improve the accuracy of segmentation.

Another method is to improve the scanning mode, called feature scanning or symbol segmentation, priority in the string to be analyzed to identify and cut out some of the obvious features of the words, as a breakpoint, the original string can be divided into smaller strings and then into the mechanical participle, thereby reducing the matching error rate.

Another method is to combine the word segmentation and lexical tagging, use rich parts of speech to help the decision making, and in the process of tagging in turn to the results of the word segmentation test, adjust, so as to greatly improve the accuracy of segmentation.

Phpanalysis participle first to the need for word segmentation, and then the rough short sentences two times the reverse maximum matching method (RMM) of the method of segmentation, word after the result of the optimization, and then get the final word segmentation results.

API documentation

Member variable

Variable Default Value description
$resultType 1-generated word breaker result data type
1 for all
2 dictionary Vocabulary and single Chinese-Japanese-Korean simple multiplication characters and English
3 Dictionary Vocabulary and English
This variable is generally set using the Setresulttype ($rstype) method.

$notSplitLen 5 The shortest length of sentence segmentation
$toLower false turn all English words into lowercase
$differMax false using the maximum segmentation mode to discriminate two-yuan words
$unitWord true to try to merge words (that is, new word recognition)
$differFreq false uses popular Word priority mode for disambiguation
member functions

__construct ()

Public function __construct ($source _charset= ' utf-8 ', $target _charset= ' Utf-8 ', $load _all=true, $source = ')
Function Description: Constructor

Parameter list:
$source _charset Source String encoding
$target _charset Directory string encoding
$load _all Whether the dictionary is fully loaded (this parameter has been invalidated)
$source Source String

If the input and output are utf-8, you can actually not initialize with any parameters, but instead set the text to be manipulated by the SetSource method

SetSource ()

Public Function SetSource ($source, $source _charset= ' Utf-8 ', $target _charset= ' Utf-8 ')
Function Description: Set the source string

Parameter list:
$source Source String
$source _charset Source String encoding
$target _charset Directory string encoding

return value: BOOL

Startanalysis ()

Public Function startanalysis ($optimize =true)
Function Description: Begin to perform participle operation

Parameter list:
Whether to try to optimize the result after $optimize participle
return value: void

A basic word segmentation process:

$pa = new Phpanalysis ();
$pa->setsource (' String to be participle ');
Set participle properties
$pa->resulttype = 2;
$pa->differmax = true;
$pa->startanalysis ();
Get the results you want
$pa->getfinallyindex ();
Setresulttype ()

Public Function Setresulttype ($rstype)
Function Description: Sets the type of return result, which is actually the operation of the member variable $resulttype

Parameter $rstype value is:

1 for all
2 dictionary Vocabulary and single Chinese-Japanese-Korean simple multiplication characters and English
3 Dictionary Vocabulary and English
return value: void

Getfinallykeywords ()

Public Function getfinallykeywords ($num = 10)
Function Description: Gets the number of specified entries that appear most frequently (typically used to extract document keywords)

Parameter list:
$num = 10 Returns the number of entries

Return value: A list of keywords separated by ","

Getfinallyresult ()

Public Function Getfinallyresult ($spword = ')
Function Description: Obtain the final participle result

Parameter list:

The separator between $spword entries

return value: String

Getsimpleresult ()

Public Function Getsimpleresult ()
Function Description: Get rough score Results

return value: Array

Getsimpleresultall ()

Public Function Getsimpleresultall ()
Function Description: Gets the rough results containing the property information

Properties (1 Chinese words, 2 ANSI words (including full-width), 3 ANSI punctuation (including full-width), 4 digits (including full-width), 5 Chinese punctuation, or unrecognized characters)

return value: Array

Getfinallyindex ()

Public Function Getfinallyindex ()
Function Description: Get hash index array
return value: Array (' word ' =>count,...) sorted by occurrence frequency

Makedict ()

Public Function makedict ($source _file, $target _file= ")
Function Description: To compile the text document thesaurus into a dictionary

Parameter list:
$source _file Source Text file
$target _file the target file (current dictionary if not specified)

return value: void

Exportdict ()

Public Function exportdict ($targetfile)
Function Description: Export current dictionary All entries are text files

Parameter list:
$targetfile target file
return value: void

Simple example

Require_once ' phpanalysis2.0/phpanalysis.class.php ';
$pa =new phpanalysis ();

$pa->setsource ("Phpanalysis word system is based on string matching word segmentation method, this method is also called mechanical word segmentation method, it is to be analyzed in accordance with a certain strategy of the Chinese character string with a" full large "machine dictionary entries in the match, If a string is found in the dictionary, the match succeeds (identifying a word). According to the different scanning direction, the string matching segmentation method can be divided into forward matching and reverse matching. According to the case of different length preference, it can be divided into the maximum (longest) matching and the minimum (shortest) matching, according to whether or not the process of POS tagging, but also can be divided into simple word segmentation method and the combination of Word segmentation and annotation integration method. Several commonly used mechanical participle methods are as follows: ");
$pa->resulttype=2;
$pa->differmax=true;
$pa->startanalysis ();
$arr = $pa->getfinallyindex ();
echo "<pre>";
Print_r ($arr);
echo "</pre>";

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.