Automatic segmentation algorithm based on statistics

Source: Internet
Author: User

Introduction: The use of words and words between words and the same frequency as the basis of word segmentation, do not necessarily need to establish a good dictionary. Large-scale training text is required to train model parameters.
Pros and Cons: Not restricted by the application area, but the selection of the training text will affect the word segmentation results.

probability maximal statistic word segmentation algorithmfirst, the main    principle

For any one of the statements, first list all the phrases that appear in the corpus in the order in which they appear in the statement; Each word in the set of phrases as a vertex, plus the beginning and ending vertices, organized into a directed graph in the order of the constituent statements, and then assigning weights to the paths between each of the two directly connected vertices in the directed graph, A→b, the path weight between AB is the cost of B (if B is the end vertex, then the weight value is 0), the original problem is transformed into the shortest path problem of single source, and the optimal solution can be solved by dynamic programming.

Second, the idea explanation

1. Get candidate Words

Get all the words that may appear in the sentence as candidates, but the following conditions are true: if the word is greater than 1, it must appear in the dictionary, and if the length is equal to 1, the word will not appear in the dictionary.

2. Pre-construction trend words

Assuming that the string is scanned from left to right, you can get some candidate words such as w1,w2,..., Wi-1,wi,...., and if Wi-1 is adjacent to the first word of the tail of the root WI, it is called wi-1 as the first word of WI. For example, in the above example, the candidate word "has" is the candidate word "opinion" of the former word, "opinion" and "see" are the "differences" of the former word. The leftmost word of the string has no preceding word.

3. Search for the best pre-trend words

If a candidate word wi has a number of pre-wj,wk,..... and so on, among which the cumulative probability of the largest waiting term is called the best pre-term wi. For example, the candidate word "opinion" only a precursor word "has", so "there" is also the "opinion" of the best precursor word, the candidate "divergence" there are two pre-trend words "opinion" and "see", wherein "opinion" cumulative probability is greater than "see" cumulative probability, so "opinion" is the "divergence" of the best

4. Determine the optimal path

Backtracking, from the end of the string according to the best pre-trend word guidance, search forward the optimal path.

Third, the algorithm description

① for a substring to be participle s, take all candidate words from left to right w1,w2,..., wi,..., Wn;② to the dictionary to find out the probability value P (WI) for each candidate, and record all their neighbourhood words for each candidate; ③ calculates the cumulative probability of each candidate by Formula 1, At the same time, the best their neighbourhood words of each candidate are obtained; ④ if the current word wn is the tail word of the string s, and the cumulative probability P (WN) is the largest, then WN is the end word of S; ⑤ starts from the WN and sequentially outputs the best their neighbourhood word from the right to the left, that is, the participle result of s.

Iv. Examples of demonstrations
① to "Disagree", from left to right to scan, get all candidates: "have", "intentional", "opinion", "see", "disagreement";
② to each candidate, the probability value is recorded, and the cumulative probability is assigned to 0.
③ calculates the cumulative probability value of each candidate word sequentially, and records the best their neighbourhood word of each candidate.
P ' (have) =p (have),
P ' (intentional) = P (intentional),
P ' (Opinion) = P ' (have) * p (Opinion), (the "opinion" of the best their neighbourhood word for "have")
P ' (see) = P ' (intentional) * p (see), ("See" the best their neighbourhood word for "intentional")
P ' (Opinion) >p ' (see)
"Divergence" is the tail word, "opinion" is the best their neighbourhood word of "disagreement", the end of the participle process, the output: there/opinion/disagreement/

V. Disadvantages of the algorithm

① Maximum probability segmentation method can not solve all the intersection ambiguity problem: "This is not OK down":
w1= this/thing/true/definite/not/down/
w2= this/thing/OK/not/down/
P (W1) <p (W2)

② can't solve combinatorial ambiguity problem: "Finish homework to watch TV"
w1= do/finish/work/talent/watch/TV/
w2= do/finish/homework/only/CAN/watch/TV/

Automatic segmentation algorithm based on statistics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.