CRFs ++ usage Summary

Source: Internet
Author: User

1. Brief Introduction

The CRF model will be applied for sequence recognition in the near future. CRFs ++ toolkit is selected. In details, the Windows version number of CRFs ++ is used in the C # environment of vs2008. This article summarizes the information related to the CRF ++ toolkit.

The entrance exam materials are the official site of CRF ++: yet another CRF toolkit. Many blog posts on the Internet about CRF ++ are all or part of this article, this article also translated some.

2. Download the Toolkit

First, select the version number. The latest version number is the CRF ++ 0.54 version number updated on May 16, but this version number was used once and seems to have encountered some problems during execution, some people on the Internet also said that there is a problem, so here we use the: CRF ++ 0.53 version. Information about execution errors is http://ir.hit.edu.cn/bbs/viewthread.php? Action = printable & tid = 7945.

2. download files. Only files of the latest version 0.54 can be searched online on this homepage, but not many resources, I downloaded a CRF ++ 0.53 version on csdn, including the Linux and Windows versions, which cost 10 points. As I have not found any links that are more stable, long-term, and free of charge than Supervisor, upload this file: CRF ++ 0.53 Linux and Windows version.

3. Toolkit File

Doc directory: the content of the official homepage.
Example Directory: contains training data, trial data, and template files for four tasks.
SDK Directory: CRFs ++ header file and static Link Library.
Crf_learn.exe: CRF ++ training program.
Crf_test.exe: Pre-release program of CRF ++
Libcrfpp. dll: the static Link Library required by the training program and the pre-preparation program.

For example, you must use the crf_learn.exe, crf_test.exe, and libcrfpp. DLL files.

4. command line format

4.1 training procedure

Command line:
% Crf_learn template_file train_file model_file
The training process time, iterations, and other information will be output to the console (it seems that the crf_learn program outputs information to the standard output stream). If you want to save this information, we can stream these standard outputs to files. The command format is as follows:
% Crf_learn template_file train_file model_file> train_info_file

There are four basic shards that can be adjusted:
-A CRF-L2 or CRF-L1 
Normalization algorithm selection. The default is CRF-L2. Generally, the L2 algorithm is slightly better than the L1 algorithm, although the non-zero feature values in the L1 algorithm are much smaller than those in the L2 algorithm.
-C float
The number of workers is set to hyper-parameter of CRF. The larger the value of C, the higher the degree of CRF fitting training data. This number of measures can adjust the balance between overfitting and unfitting. This number of workers can be used to find a better number of workers through cross-validation.
-F num
This number of records sets the cut-off threshold of the feature. CRF ++ uses features that appear at least num in the training data. The default value is 1. When CRF ++ is used for large-scale data, a single occurrence of a feature may have millions of characters, which will play a role in this case.
-P num
If the computer has multiple CPUs, the training speed can be improved through multiple threads. Num indicates the number of threads.

Command Line example with two shards:
% Crf_learn-F 3-C 1.5 template_file train_file model_file

4.2 trial program

Command line:
% Crf_test-M model_file test_files
There are two sequences-V and-N both show some information,-V can display the probability value of the pre-sequence label,-N can display the probability value of different possible sequences, for accuracy, the recall rate and execution efficiency are not affected.
Similar to crf_learn, the output result is stored in the standard output stream, and the output result is the most important pre-Attention result information (the content of the trial file + pre-Attention annotation ), similarly, you can use redirection to save the results. The command line is as follows.
% Crf_test-M model_file test_files> result_file

5. File Format

5.1 Training Files

The following is an example of a training file:


The training file consists of several sentences (which can be understood as several training examples). Different sentences are separated by line breaks and two sentences are displayed. Each sentence can have several tags. The last tag is a tag with three columns, that is, the first and second columns are known data, and the third column is a pre-defined tag, the example above is to pre-label the third column based on the first column of words and the second column of speech.

Of course, there are labeling problems involved here. This is what many papers need to study. For example, there are many different tagging sets for nameobject recognition. This is beyond the scope of this article.

5.2 trial File

The format of the trial file is the same as that of the training file. You can understand this though you have used the machine learning toolkit.

Unlike SVM, CRF ++ does not have a separate result file, and the pre-commit result is output through the standard output stream. Therefore, the result is redirected to the file in the command line in section 4.2. The result file has one more column than the trial file, that is, the pre-labeling label. We can calculate the labels of the last two columns, one column, and one column of the pre-labeling label, to obtain the accuracy of tag pre-commit.

5.3 Template File

5.3.1 template Basics

Each row in the template file is a template. Each template uses % x [row, Col] to specify a token in the input data. Row specifies the row offset to the current token, and Col specifies the column position.

It can be seen that the current token is the word. % X [-] is the first two rows of the, and the element of Column 1 (note that the column starts from column 0.

5.3.2 template type

There are two types of templates. The template type is specified by the first character.

Unigram template: first character, 'U'
When a "u01: % x []" template is provided, CRF ++ generates a set of feature functions (func1... funcn), for example, below ).

The preceding examples show the features of % x [1st], which are based on the part of speech (2nd columns) of words (columns) to pre-predict its annotation (3rd columns), these functions reflect the training example, func1 reflects "in the training example, the part of speech is DT and the annotation is B-NP ", func2 reflects the situation where the part of speech is DT and the annotation is I-NP in the training example ".
The number of template functions is L * n, where L is the number of classification in the annotation set, and N is the string type extended from the template.

Bigram template: first character, 'B'
This template describes binary features. This template automatically generates the merge of the current output token and the previous output token. Note that such a template will produce L * l * n different features.

What is the difference between unigram feature and bigram feature? 
Unigram/bigram is very easy to confuse, because unigram-features can also write bigram (Binary features) similar to % x [-] % x ). Here, unigram and bigram features specify the output tag of Uni/bigrams.
Unigram: | output tag | x | all possible strings expanded with a macro |
Bigram: | output tag | x | all possible strings expanded with a macro |
Here, the mona1/dual element refers to the output tag. I haven't seen this detailed sample. The four examples in the example directory only use unigram, bigarm is not practical, so it generally seems that unigram feature is enough.

5.3.3 template example

This is a template sample of the base-NP chunking task of conll 2000. Only a bigram template ('B') is used '). This means that only the previous output token and the current token are treated as bigram features. "#" The starting line is staring, and empty rows do not make sense.


6. Example data

The example directory contains four tasks: basenp, chunking, japanesene, and seg. The first two are English data and the last two are Japanese data. The first is named entity recognition, the second is word segmentation, the third is Japanese name Entity recognition, and the fourth is unclear. Here we ran the first two tasks, but the last two tasks were not understood in Japanese.

According to the Linux step file below the task, I wrote a simple windows batch processing (which saved the information with redirection), for example, named exec. bat, and ran it. Put the batch file in the path of the task to be run. The content of the batch file is as follows:
... \ Crf_learn-C 10.0 template train. Data Model> train-info.txt
... \ Crf_test-M model test. Data> test-info.txt

Here is a brief explanation of the batch processing. The current folder after the batch processing file is executed is the folder where the batch processing file is located (at least my folder is like this, if not, you can use cd % ~ Dp0 command ,~ Dp0 indicates the "current drive letter and path"). The crf_learn and crf_test programs are in the first two levels of folders of the current folder, so... \ is used ..\..\.

7. Summary

Command Line (command line format, number of shards, redirection)

Number of calls (usually the C value in the training process)

Annotation set (this is very important and research-related)

Template File (this is also very important, research-related)

The unigram feature and bigram feature of the template file, as mentioned earlier, here refers to the output of one/two yuan, the application is not particularly familiar with the situation, you also need to know some paper skills.

 

From: http://www.cnblogs.com/pangxiaodong/archive/2011/11/21/2256264.html

CRF ++ usage Summary (transfer)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.