A brief discussion on the basic problems in Word segmentation algorithm (1)

Last Update:2018-02-24 Source: Internet

Author: User

Tags ming

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[TOC]

Objective

Word segmentation or word-cutting is a classic and basic problem in natural language processing, in the work of ordinary times also repeated contact with the word segmentation problem, using different models, different methods applied in various fields, so want to do a systematic carding of word segmentation problems. Most of the word segmentation problems are mainly for similar Chinese, Korean, Japanese and so on, there is no natural division between words, and like English, sentences are with natural segmentation. But English will also involve word segmentation, such as entity recognition, part-of-speech tagging and so on. And this series of articles is to discuss the word segmentation in Chinese, first of all, we start from the basic word segmentation problem, then from the traditional dictionary participle to the word segmentation to sequence labeling problems, as well as the latest combination of deep learning of the word segmentation, according to the time sequence of the word segmentation model may be used in a step-by-step attempt and introduction All the code I'll put on my own GitHub: xlturing.

Directory

A brief discussion on the basic problems in Word segmentation algorithm (1)
On the algorithm of Word segmentation (2) Word segmentation method based on dictionary
On the algorithm of Word segmentation (3) Word segmentation method (HMM)
On Word segmentation algorithm (4) word-based Word segmentation method (CRF)
On Word segmentation algorithm (5) Word segmentation method (LSTM)

The basic problems in participle

Simply speaking, Chinese automatic word segmentation is to let the computer in Chinese text between words and words automatically add space or other border mark. There are three basic problems in Word segmentation: Segmentation specification, ambiguity segmentation and the recognition of non-login words.

Participle specification

We began to learn Chinese, the basic sequence is the Chinese character--------------------------------- According to the experts ' survey, the recognition rate of words appearing in Chinese text is only about 70%, and in the strict sense of computation, automatic word segmentation is an undefined problem [Huangchangning, 2003]. To give a simple example:
Xiao Ming saw the flowers on the shore of the lake, and a little unknown flower caught his attention.

For this sentence in the "Lake Shore", "Flowers", "not well-known" and so on, different ways of defining the words will appear differently, such as we can cut into the following several forms:

"Xiao Ming/See/Lake/On/On/flowers/, a plant/unknown/floret/caused///his/attention"
"Xiao Ming/See/Lake/Shore/On/flower/grass, A/not/well-known//floret/caused/his/attention"
"Xiao Ming/See/Lake/On/flower/grass, one/unknown/floret/caused/his/attention"

We can see the different definitions of words, can be combined with a lot of word segmentation results, so that participle can be seen as a search for a problem without a clear definition of the answer. So when we weigh a word segmentation model, we first need to determine a unified standard, that is, the so-called golden data, all of our models are trained and evaluated on a unified data set, so that the comparison will be a reference.

Ambiguity segmentation

Ambiguous fields are ubiquitous in Chinese, and ambiguous fields are one of the most important problems in Chinese segmentation. Liangnanyum the first two basic definitions of ambiguous fields:

Intersection type cut divides the meaning: the Chinese character string AJB is called the intersection type to divide the difference meaning, if satisfies AJ, JB simultaneously is the word (A, J, B is the Chinese character string respectively). At this point, the Chinese character string J is called intersection string. For example, college students (University/student), Research Biology (Postgraduate/biological), combine (combine/synthesize).
The combination type cut divides the meaning: the Chinese character string AB is called the multi-meaning combination type to divide the difference meaning, if satisfies a, B, AB simultaneously is the word. such as, get Up (he | stand | up | body | come/tomorrow | rise | To Beijing), Student Union (I am in | student Union | help/My |
Student | will come | help

We can see that the ambiguity field has brought a great deal of trouble to our word segmentation problem, so we want to make the correct segmentation judgment, we must combine the contextual context, even the rhythm, tone, stress, pause and so on.

No sign-in Word recognition

No sign-in words, one refers to the existing vocabulary is not included in the word, the other refers to the training corpus has not appeared in the word. The latter meaning can also be referred to as the set of words, OOV (out of vocabulary), which is a word outside the training set. Usually the non-login words and Oov are one thing, we don't differentiate here.
No sign-in words can be broadly divided into the following types:

New common words, such as the emergence of new words in the network language, this is a big challenge in our word-breaker system, general for large-scale data segmentation system, will be specifically integrated with a new word discovery module, for the discovery of new words mining, after verification to add to the dictionary.
The proper noun, in the word breaker we have a specialized module, named body Recognition (NER name entity recognize), used to identify individual names, place names, and organization names.
The names of professional nouns and fields of study, which appear in the field of general participle, are relatively few, and if there is a special new field, the profession will produce a new batch of vocabulary.
Other special nouns, including other newly produced product names, films, books, etc.

After statistics Chinese word segmentation problem is caused by the non-login word, then the word segmentation model will be an important index to measure the quality of a system.

The commonly used Chinese word segmentation method based on the dictionary Word segmentation method

Dictionary-based approach is a classic traditional word segmentation method, which is very intuitive, we extract word segmentation from a large-scale training corpus, and at the same time the word frequency statistics, we can use the inverse of the maximum matching, n-Shortest path and N-gram model equal words method to the sentence segmentation. Dictionary-based Word segmentation method is very intuitive, we can easily increase or decrease the dictionary to adjust the final word segmentation effect, such as when we find that a new noun can not be precisely divided, we may directly in the dictionary to add, in order to achieve the exact point of the purpose , the same too dependent on the dictionary also led to this method for the non-login word processing is not very good, and when the dictionary word in the common substring, there will be ambiguous segmentation problem, which requires sufficient corpus, so that the frequency of each word can have a good setting.

Word-based Word segmentation method

Different from the dictionary-based Word segmentation method, it is necessary to rely on a pre-compiled dictionary to make the final segmentation decision by checking the dictionary; The word segmentation is regarded as the classification problem of word, and it is believed that each word occupies a definite word-formation position (word position) when constructing a certain word. [1] This method was first proposed by Silline and other people in 2002, and in a variety of word-breaker has achieved good results, especially to the problem of non-login words, recall rate has been very high.
In general, we think that the word bit of each word has 4 cases: B (Begin), E (End), M (middle), S (single), then we can divide a sentence into the process of labeling each word in a sentence, for example:

Natural language processing/CAN/applications/in/many/fields.
Since B, M-language m-word M-E may B with E should B with E in s various b more E collar B domain E.

We assign a word to each word in the sentence, that is, a label in the BEMs, so that we can accomplish the purpose of the participle.
The word-based method transforms the traditional linguistic problem into a more easily modeled sequence labeling problem, we can use the maximum Entropy model to label each word, or we can use HMM as a decoding problem, or consider the temporal relation between sentences, and use discriminant model CRF to model Also, the lstm of hot deep learning can be modeled here.

Summarize

This blog post we first briefly introduce the word segmentation problem itself, its focus and difficulty is what, has been in the direction of the use of which methods and models, follow-up we will select the commonly used word segmentation model to introduce and implement one by one. Here is a special note, this series of articles in the introduction of a separate model for the introduction, in the actual production environment, we tend to integrate a variety of methods to improve accuracy and recall, such as in GitHub is often mentioned in the stuttering participle of the N-gram dictionary word segmentation and hmm word segmentation, When we use it, we should choose and match according to the actual environment, when necessary, the model should be re-train and adjusted.

If there is any mistake, please correct me.

Reference documents

The 2nd edition of Statistical Natural language processing

The basic problems in Word segmentation algorithm (1)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More