Chinese word segmentation based on CRF

Source: Internet
Author: User

Http://biancheng.dnbcw.info/java/341268.html

About CRF

Conditional Random Field: Conditional with the airport, a machine learning technology (model)

The CRF was first used by John Lafferty in the field of NLP technology and is used primarily in the field of NLP technology for text labeling and has a variety of application scenarios, such as:

    • Participle (word bit information of the label word, word formation)
    • POS Tagging (the part of speech that marks a word, e.g. nouns, verbs, auxiliary words)
    • Named entity recognition (identification of names, place names, institutional names, commodity names, etc. with certain intrinsic laws of the entity noun)

This paper mainly describes how to use CRF technology to do Chinese word segmentation.

CRF VS Dictionary Statistical participle
    • The dictionary-based word segmentation relies on dictionaries and rules, so the recognition ability of ambiguous words and non-sign words is low; its advantages are fast and high efficiency.
    • CRF represents a new generation of machine learning technology participle, the basic idea is to label the Chinese character is the word word formation (group words), not only to consider the occurrence of the text word frequency information, while considering the context, with a good learning ability, so it has a good effect on the recognition of ambiguous words and non-sign words. , the shortcomings of the training cycle is longer, operating a large amount of computation, performance is inferior to dictionary
CRF VS Hmm,memm
    • First of all, crf,hmm (hidden Markov type), Memm (maximum entropy hidden Markov type) are often used to model sequence callouts, such as participle, part-of-speech tagging, and named entity annotations
    • One of the biggest drawbacks of the hidden horse model is that its output independence hypothesis makes it impossible to consider the characteristics of the context and restricts the selection of features.
    • The maximum entropy hidden horse model solves the problem of hidden horse, can choose the characteristic arbitrarily, but because of its normalization in each node, it can only find the local optimal value, and also brings the problem of the mark bias, that is, all the things in the training corpus are ignored.
    • Conditions with the airport is a good solution to this problem, he does not in each node to the normalization, but all the characteristics of the global normalization, so that the global optimal value can be obtained.
The principle of CRF participle

1. CRF refers to word segmentation as the word classification problem, usually defines the word bit information as follows:

    • The first word, commonly used B means
    • In words, the commonly used m denotes
    • The ending, commonly used e means
    • A list of words, commonly used in s notation

2. The process of CRF participle is that after the word is labeled, the words between B and E, and the S-word form participle

3. Examples of CRF participle:

    • Original example: I love Beijing Tiananmen
    • CRF after labeling: I/S love/s North/b/Beijing/b/E/M door/E
    • Participle result: I/Love/Beijing/Tiananmen
CRF Word breaker Toolkit

The above describes the technical thinking of CRF and how to use the word segmentation, the following will be described in the actual development of using CRF for word segmentation work. Currently common CRF Toolkit has pocket CRF, FLEXCRF car crf++, there are also some of their 3 comparison report, personal feeling crf++ in ease of use, stability and accuracy of the overall performance of the best, but also in the company's project development has been in use, So here is an overview of how crf++ is used (details can be found on the crf++ official homepage, http://crfpp.sourceforge.net/)
1. Installation
Compiler requirements: C + + compiler (GCC 3.0 or higher)
Command:
%./configure
% make
% su
# make Install
Note: Only users who have the root account will be able to install successfully.
1. Role
2.1 Format of training and test files
The training and test files must contain multiple tokens, each token containing multiple columns. Tokens can be defined in terms of specific tasks, such as words, parts of speech, and so on. Each token must be written in one line, with spaces or table spacing between the columns. A token sequence can form a sentence,sentence between a single empty line interval.
The last column is the correct labeling form for the CRF for training.
For example:
iphone ASCII S
is CN S
One CN S >> current token
Section CN S
Not CN B
Wrong CN E
The CN S
Hand CN B
Machine CN E
, PUNC S
Also CN S
Can CN S
Listen CN B
Song CN E
。 PUCN S
In my example above, each token contains 3 columns, respectively, the word itself, the type of word (English numerals, kanji, punctuation, etc.) and the word bit markers.
Note: If the number of columns per token is inconsistent, the system will not run correctly.
2.2 Preparing feature templates
Users who use crf++ must determine the feature template themselves.
1) Basic templates and macros
Each row in the template file represents one template. In each template, a dedicated macro%x[row,col] is used to determine a token in the input data. Row is used to determine the relative number of rows with the current token. Col is used to determine the absolute number of rows.
If the following input data is known:
iphone ASCII S
is CN S
One CN S >> current token
Section CN S
Not CN B
Wrong CN E
The CN S
Hand CN B
Machine CN E
The feature template form is:
# Unigram
u00:%x[-2,0]
u01:%x[-1,0]
u02:%x[0,0]
u03:%x[1,0]
u04:%x[2,0]
u05:%x[-1,0]/%x[0,0]
u06:%x[0,0]/%x[1,0]
u07:%x[-1,0]/%x[1,0]
u08:%x[0,1]
u09:%x[-1,1]/%x[0,1]
# Bigram
B
2) Template type
There are two types of templates, which can be determined by the first character of a template.
The first one is Unigram template: The first character is U
This is the template used to describe the Unigram feature. When you give a template "u02:%x[0,0", the CRF automatically generates a set of feature functions (Func1 ... funcN), such as:
func1 = if (output = B and feature= "U02: One") return 1 else return 0
Func2 = if (output = M and feature= "U02: One") return 1 else return 0
func3 = if (output = E and feature= "U02: One") return 1 else return 0
Func4 = if (output = S and feature= "U02: One") return 1 else return 0
...
Funcx = if (output = B and feature= "U02:") return 1 else return 0
Funcy = if (output = S and feature= "U02:") return 1 else return 0
...
The total number of feature functions generated by a model is l*n, where L is the number of classes of output, and N is the number of unique strings that are expanded on the given template.
The second type of Bigram template: The first character is a B
This template is used to describe Bigram features. Using this template, the system will automatically generate a combination of the current output token and the previous output token (bigram). The total number of distinguishable features produced is L*l*n, where L is the number of output classes and N is the unique features generated by this template. When the number of categories is large, this type produces many distinguishable features, which can lead to inefficient training and testing.
3) Use identifiers to distinguish relative positions
The identifier can be used if the user needs to differentiate the relative position of the token.
For example, in the following example, the macro "%x[-2,0" and "%x[1,0" both represent "North", but they are also different "north".
North CN B
BEIJING-CN E
The CN S >> Current token
North CN S
Part CN S
To differentiate between them, a unique identifier (U00: or U03:) can be added to the model, i.e.:
u00:%x[-2,0]
u03:%x[1,0]
Under such conditions, the two models will be considered different, as they will be extended to "U00: North" and "U03: North". As long as you like, you can use any identifier, but it is more useful to use numeric ordinals because they simply correspond to the number of features.
3. Training (coding)
Use the Crf_learn command:
% Crf_learn template_file train_file model_file
Among them, template_file and train_file need to be prepared by the user beforehand. Crf_learn will generate the trained model and store it in the Model_file.
In general, Crf_learn will output the following information on the stdout. Other information related to LBFGS iterations is also output.
% Crf_learn template_file train_file model_file
crf++: Yet another CRF Tool Kit
Copyright (C) 2005 Taku Kudo, All rights reserved.
Reading training data:
done! 0.32 s
Number of sentences:77
Number of features:32856
Freq:1
eta:0.0001
C (sigma^2): 10
iter=0 terr=0.7494725738 serr=1 obj=2082.968899 diff=1
Iter=1 terr=0.1671940928 serr=0.8831168831 obj=1406.329356 diff=0.3248438053
iter=2 terr=0.1503164557 serr=0.8831168831 obj=626.9159973 diff=0.5542182244
which
ITER: Number of iterations
Terr: Error rate associated with tags (wrong tag count/all tags)
SERR: Error rate associated with sentence (number of incorrect sentence/sentence of all)
OBJ: The value of the current object. When this value converges to a certain value, the CRF model will stop iterating
diff: The relative difference between the previous object value
There are two main parameters for controlling the training conditions:
-C float: With this option, you can change the hyper-parameter of the CRF. When a large C value is taken, the CRF may have an overfitting (overfitting) effect on the training data. This parameter will adjust the balance between overfitting and underfitting. The results will have a meaningful impact on the parameters. Users can obtain the most value by using held-out data or more common model selection methods such as cross validation.
-F NUM: This parameter sets the cut-off threshold for the feature. crf++ training only uses features that occur not less than num times. The default value is 1. When using crf++ to train large-scale data, the number of single features will reach millions of, which is useful when selecting this parameter.
Here is an example of using these two parameters:
% crf_learn-f 3-c 1.5 template_file train_file model_file
4. Testing (decoding)
Use the Crf_test command:
% crf_test-m model_file test_files ...
Among them, Model_file was created by Crf_learn. The user does not need to specify the template file during the test because the mode file already has the template information. Test_file is the test corpus for which you want to label the sequence tag. This file should be written in the same format as the training file.
Here is an example of a crf_test output:
% Crf_test-m Model Test.data
Rockwell NNP b b
International NNP I I
Corp. NNP I I
' s POS b b
Tulsa NNP I I
Unit NN I I
..
The last column is the tag of the model estimate. If the third column is a standard tag, you can calculate the accuracy by simply comparing the differences between the third and fourth columns.
Detailed hierarchy (verbose level)
-V Option: Sets the verbose level. The default value is 0. By adding layers, you can get additional information from crf++.
Level 1:
You can use the edge probability (marginal probabilities) for each tag (this is a confidence measure for the output tag), use conditional probabilities on the output (conditional probably) (confidence measure for the entire output).
For example:
% Crf_test-v1-m Model test.data| Head
# 0.478113
Rockwell NNP B b/0.992465
International NNP I i/0.979089
Corp. NNP I i/0.954883
' s POS B b/0.986396
Tulsa NNP I i/0.991966
...
Where the first line of "# 0.478113" is the conditional probability of the output, and each output tag each contains a probability, the expression form such as "b/0.992465".
Level 2:
You can find the edge probabilities for all the other candidates.
For example:
% Crf_test-v2-m Model Test.data
# 0.478113
Rockwell NNP B b/0.992465 b/0.992465 i/0.00144946 o/0.00608594
International NNP I i/0.979089 b/0.0105273 i/0.979089 o/0.0103833
Corp. NNP I i/0.954883 b/0.00477976 i/0.954883 o/0.040337
' s POS B b/0.986396 b/0.986396 i/0.00655976 o/0.00704426
Tulsa NNP I i/0.991966 b/0.00787494 i/0.991966 o/0.00015949
Unit NN I i/0.996169 b/0.00283111 i/0.996169 o/0.000999975
..
N-best outputs
-N option: Use this option to get the n-best result, which is sorted according to the conditional probabilities computed by the CRF. When the n-best result is selected, crf++ automatically adds a row in the form "# N prob", where N is the sorted output, starting at 0. Prob represents the conditional probability of output.
It is important to note that if crf++ cannot find enough n paths, it will discard the enumeration n-best results. This is often the case when the sentences given are very short.
Crf++ uses a combination of forward Viterbi and a back a * search. This combination adapts to the demand for n-best results.
Here is an example of a n-best result:
% crf_test-n 20-m Model Test.data
# 0 0.478113
Rockwell NNP b b
International NNP I I
Corp. NNP I I
' s POS b b
...
# 1 0.194335
Rockwell NNP b b
International NNP I I

Chinese word segmentation based on CRF

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.