On the algorithm of Word segmentation (2) Word segmentation method based on dictionary

Source: Internet
Author: User

[TOC]

Objective

In the basic problem of Word segmentation algorithm (1), we discuss the basic problem in Word segmentation, and also mention the word segmentation method based on dictionary. Dictionary-based Word segmentation method is a more traditional way, this kind of word segmentation method has many, such as: forward maximum matching (forward maximum matching method, FMM), inverse maximum matching (backward maximum matching method, BMM), bidirectional scanning method, word-by-phrase traversal method, N-Shortest path method, word-based N-gram grammar model segmentation method and so on. For this kind of method, the collation selection of dictionaries occupies a very important role, this article mainly introduces the word segmentation method based on N-gram, this kind of method is common in the usual word segmentation tool, and the performance is also good.

Directory

A brief discussion on the basic problems in Word segmentation algorithm (1)
On the algorithm of Word segmentation (2) Word segmentation method based on dictionary
On the algorithm of Word segmentation (3) Word segmentation method (HMM)
On Word segmentation algorithm (4) word-based Word segmentation method (CRF)
On Word segmentation algorithm (5) Word segmentation method (LSTM)

Fundamentals Bayesian formula

Referring to the probability Word segmentation method based on N-gram, first we will say the great Bayesian theory, and speaking of Bayesian theory, first of all, the Bayesian formula:

Bayesian formula is also the basis of probability theory, here we do not repeat, recommend an article of mathematical beauty: the Ordinary and magical Bayesian method, speak very good. Below we mainly focus on how to use Bayesian principle in participle.

Bayesian in participle

We know that P (Y) is usually a constant, so we use the following formula when using the Bayesian formula:

When Bayesian theory is applied to discrete data sets, the frequency can be used as probability to calculate, in the word segmentation algorithm, in a given training corpus, we use the Word as the unit of statistics, the frequency of each word appears, when there is a sentence to be divided, we will all possible segmentation results are counted out, Calculate the maximum probability as the segmentation result. In the formal language, describe the following:
Assuming that the training data set is, where the dictionary set is D and is the first word in a sentence of length n, then the joint probability of a sentence can be expressed as:

In other words, the probability of each word in a sentence is a conditional probability that relies on all the words in front of it. Speaking here we are the usual routines, obviously this thing can not be calculated, then how to do it, that is commonly used in Bayesian theory, to do some conditions independent hypothesis Bai, which is called N-gram in the origin of N.

    • 1-gram (Unigram), a unary model, each word in a sentence is independent of each other, the above formula can be simplified as follows:
    • 2-gram (Bigram), a two-dollar model, each word in a sentence relies only on one of its preceding words:
    • 3-gram (trigram), ternary model, each word in a sentence relies on its first two words:

Generally speaking, we only see the first two words, there are studies show that more than 4 models do not have a better effect (obviously, the greater the N, we need to find the N-tuple of words appear less frequently, it will directly lead to data sparse problem), usually we use the 2-gram model mostly.

Examples of 2-gram participle

Suppose that the statement to be split is: "Research biology", how do we divide it, intuitively speaking we can see that the simple words include "research", "Postgraduate", "Biology", "biology" multiple words, then intuitively we have the following several ways of slicing:

    • Research/Biology
    • Postgraduate/material/study
    • Research/Biology/Learning
    • Research/study/health/physical/learning

We construct these tangent methods as a direction-free graph, the nodes are words, the edges are conditional probabilities.

(Excerpt from [4])
Then, according to the maximum likelihood principle, the process of our participle turns to the problem of solving the best path in the graph, we only need to select any search algorithm, for example, in stuttering participle, we use dynamic programming to find the maximum probability path.

1-gram instances

Above said so many, or on the code comparison has dry goods, we take 1-gram as an example, to carry on a elaboration, here we mainly refer to stuttering participle. In the process of implementation of the core issues involved: the establishment of a prefix dictionary tree, based on the sentence to establish a DAG (directed acyclic graph), the use of dynamic programming to obtain the maximum probability path.

Building a prefix dictionary tree

The code is as follows:

        with open(dict_path, "rb") as f:            count = 0            for line in f:                try:                    line = line.strip().decode('utf-8')                    word, freq = line.split()[:2]                    freq = int(freq)                    self.wfreq[word] = freq                    for idx in range(len(word)):                        wfrag = word[:idx + 1]                        if wfrag not in self.wfreq:                            self.wfreq[wfrag] = 0  # trie: record char in word path                    self.total += freq                    count += 1                except Exception as e:                    print("%s add error!" % line)                    print(e)                    continue

We use Dict to build this prefix dictionary tree, and when we encounter a word, all the substrings in the word path are recorded in the dictionary tree. (In fact, this way of storage is a large number of redundant substrings, but the query is more convenient)

Building a DAG

The code is as follows:

    def get_DAG(self, sentence):        DAG = {}        N = len(sentence)        for k in range(N):            tmplist = []            i = k            frag = sentence[k]            while i < N and frag in self.wfreq:                if self.wfreq[frag]:                    tmplist.append(i)                i += 1                frag = sentence[k:i + 1]            if not tmplist:                tmplist.append(k)            DAG[k] = tmplist        return DAG

Since word and word all prefixes are added to the dictionary at the time of loading the dictionary, you can conclude that the Frag and frag-prefixed words do not exist in the dictionary once Frag is in wfreq, so you can jump out of the loop.

Using dynamic programming to get the maximum probability path

It is worth noting that each node of the DAG is weighted, for the words in the dictionary, its word frequency, that is, Wfreq[word]. We ask for route = (W1, w2, W3,.., wn), making Σweight (WI) the largest.

Dynamic Programming Solution Method

There are two conditions that satisfy the DP

    • Repeating sub-problems
    • Optimal sub-structure

Let's analyze the maximum probability path problem.

Repeating sub-problems
For junction WI and several subsequent WJ and wk that may exist, there are:

    1. Any path that reaches WJ via WI is weighted for that path by the path weight of WI plus WJ's weight {ri->j} = {Ri + weight (j)};
    2. Any path that reaches WK via WI is weighted for that path by the path weight of WI plus wk's weight {ri->k} = {Ri + weight (k)};

Optimal sub-structure
For the optimal path of the whole sentence Rmax and an end node WX, for its possible existence of multiple precursor wi,wj,wk ..., the maximum path to Wi,wj,wk is RMAXI,RMAXJ,RMAXK, respectively:
Rmax = Max (rmaxi,rmaxj,rmaxk ...) + weight (Wx)
So the question turns into:
Beg Rmaxi, RMAXJ, rmaxk ...
The optimal sub-structure is formed, and the optimal solution in the substructure is a part of the global optimal solution.
It is easy to write out its state transition equation:
Rmax = max{(rmaxi,rmaxj,rmaxk ...) + weight (Wx)}

Code

The code is as follows:

    def get_route(self, DAG, sentence, route):        N = len(sentence)        route[N] = (0, 0)        logtotal = log(self.total)        for idx in range(N - 1, -1, -1):            route[idx] = max((log(self.wfreq.get(sentence[idx:x + 1]) or 1) -                              logtotal + route[x + 1][0], x) for x in DAG[idx])

It is worth noting here that the log function is used when calculating the frequency, which turns the division into subtraction and prevents overflow.

Full code

For the sentence "I am Chinese", we can see the effect as shown:

I put the complete code on the Git, the dictionary here is stuttering participle of the dictionary, and many of the code is from the stuttering word reuse, we need to be able to look at:
Https://github.com/xlturing/machine-learning-journey/tree/master/seg_ngram

Reference documents
    1. The beauty of Mathematics: a common and magical Bayesian approach
    2. The N-gram model in natural language processing
    3. The word segmentation method of probabilistic language model
    4. "Statistics Natural language processing" Zong Cheng Qing
    5. Stuttering participle python
    6. Jieba Participle Study notes (iii)

On the algorithm of Word segmentation (2) Word segmentation method based on dictionary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.