1. Brief Introduction
Recently, the CRF model is used for sequence recognition. CRFs ++ toolkit is selected. Specifically, the CRFs ++ Windows version is used in the C # environment of vs2008. This article summarizes the information related to the CRF ++ toolkit.
Reference material is the official website of CRF ++: yet another CRF toolkit. Many blogs on the Internet about CRF ++ are all or part of this article, this article also translated some.
2. Download the Toolkit
First, select the version. The latest version is CRF ++ 0.54, which was updated on February 16,. However, I used this version once and it seems that there are some problems when running it, some people on the Internet also said that there is a problem, so here we use the: CRF ++ 0.53 version. Information about running errors is http://ir.hit.edu.cn/bbs/viewthread.php? Action = printable & tid = 7945.
Second, download files. Only files of the latest version 0.54 can be searched online on this homepage, but not many resources. I downloaded a CRF ++ 0.53 version on csdn, including Linux and Windows versions, it takes 10 points. Because I have not found a stable, long-term, and free link. Here I upload this file: CRF ++ 0.53 Linux and Windows.
3. Toolkit File
Doc Folder: the content of the official homepage.
Example Folder: contains training data, test data, and template files for four tasks.
SDK Folder: CRFs ++ header file and static Link Library.
Crf_learn.exe: CRF ++ training program.
Crf_test.exe: prediction program for CRF ++
Libcrfpp. dll: the static Link Library required by the training program and prediction program.
These three files are used: crf_learn.exe, crf_test.exe, and libcrfpp. dll.
4. command line format
4.1 training procedure
Command line:
% Crf_learn template_file train_file model_file
The training process time, iterations, and other information will be output to the console (it seems that the crf_learn program outputs information to the standard output stream). If you want to save this information, we can stream these standard outputs to the file. The command format is as follows:
% Crf_learn template_file train_file model_file> train_info_file
There are four main parameters to adjust:
-A CRF-L2 or CRF-L1
Normalization algorithm selection. The default is CRF-L2. Generally, the L2 algorithm is slightly better than the L1 algorithm, although the non-zero feature values in the L1 algorithm are much smaller than those in the L2 algorithm.
-C float
This parameter sets the hyper-parameter of CRF. The larger the value of C, the higher the degree of CRF fitting training data. This parameter can adjust the balance between overfitting and unfitting. This parameter can be used to find optimal parameters through cross-validation.
-F NUM
This parameter sets the feature cut-off threshold. CRF ++ uses features that appear at least NUM in the training data. The default value is 1. When CRF ++ is used for large-scale data, the number of features that appear only once may be several million. This option will play a role in this case.
-P NUM
If the computer has multiple CPUs, the training speed can be improved through multiple threads. NUM indicates the number of threads.
Command Line example with two parameters:
% Crf_learn-f 3-c 1.5 template_file train_file model_file
4.2 Test Procedure
Command line:
% Crf_test-m model_file test_files
Two parameters-v and-n both display some information.-v can display the probability value of the prediction label, and-n can display the probability value of different possible sequences. for accuracy, there is no impact on the recall rate and operation efficiency.
Similar to crf_learn, the output result is stored in the standard output stream, and the output result is the most important prediction result information (test file content + prediction annotation). You can also use redirection, save the result. The command line is as follows.
% Crf_test-m model_file test_files> result_file
5. File Format
5.1 Training Files
The following is an example of a training file:
The training file consists of several sentences (which can be understood as several training samples). Different sentences are separated by line breaks and two sentences are displayed. Each sentence can have several tags. The last tag is a tag with three columns, that is, the first and second columns are known data, and the third column is the tag to be predicted, the preceding example predicts the labeling of the third column based on the first column of words and the second column of speech.
Of course, there are labeling problems involved here. This is what many paper will study. For example, there are many different tagging sets for nameentity recognition. This is beyond the scope of this article.
5.2 Test File
The format of the test file is the same as that of the training file. You can understand the machine learning toolkit.
Unlike SVM, CRF ++ does not have a separate result file and the prediction result is output through the standard output stream. Therefore, the result is redirected to the file in the command line in section 4.2. The result file has one more column than the test file, that is, the predicted label. We can calculate the labels of the last two columns, the labels of the first column, and the predicted labels of the first column, to get the accuracy of label prediction.
5.3 Template File
5.3.1 template Basics
Each row in the template file is a template. Each template uses % x [row, col] to specify a token in the input data. Row specifies the row offset to the current token, and col specifies the column position.
It can be seen that the current token is the word. % X [-] is the first two rows of the, and the element of Column 1 (note that the column starts from column 0.
5.3.2 template type
There are two types of templates. The template type is specified by the first character.
Unigram template: first character, 'U'
When a "U01: % x []" template is provided, CRF ++ generates the following feature functions (func1... funcN ).
The preceding examples show the features of % x [1st], which are based on the part of speech (2nd columns) of words (columns) to predict its annotation (3rd columns), these functions reflect the training sample, func1 reflects "training sample, part of speech is DT and labeled as B-NP ", func2 reflects "in the training sample, the part of speech is DT and labeled as I-NP ".
The number of template functions is L * N, where L is the number of classification in the annotation set, and N is the string type extended from the template.
Bigram template: first character, 'B'
This template is used to describe binary features. This template will automatically generate the merge of the current output token and the previous output token. Note that this type of template will produce L * N different features.
What is the difference between Unigram feature and Bigram feature?
Unigram/bigram is easy to confuse, because unigram-features can also be used to write bigram (Binary features) at the word level similar to % x [-] % x ). Here, unigram and bigram features specify the output tag of uni/bigrams.
Unigram: | output tag | x | all possible strings expanded with a macro |
Bigram: | output tag | x | all possible strings expanded with a macro |
Here, the RMB/dual element refers to the output tag. I haven't seen this specific example yet. The four examples in the example folder only use Unigram and do not use Bigarm, therefore, the general Unigram feature is enough.
5.3.3 template example
This is a template example of the Base-NP chunking task of CoNLL 2000. Only one bigram template ('B') is used '). This means that only the previous output token and the current token are treated as bigram features. The line starting with "#" is a comment, and empty rows do not make sense.
6. Sample Data
The example folder contains four tasks: basenp, chunking, JapaneseNE, and seg. The first two are English data and the last two are Japanese data. The first is named entity recognition, the second is word segmentation, the third is Japanese name Entity recognition, and the fourth is unclear. Here we ran the first two tasks, but the last two tasks were not understood in Japanese.
According to the linux step file under the task, I wrote a simple windows batch processing (with redirection to save the information), for example, named exec. bat and ran it. Put the batch file in the path of the task to be run. The content of the batch file is as follows:
... \ Crf_learn-c 10.0 template train. data model> train-info.txt
... \ Crf_test-m model test. data> test-info.txt
Here is a brief explanation of the batch processing. The current directory after the batch processing file is run is the directory where the batch processing file is located (at least I do. If not, you can use cd % ~ Dp0 command ,~ Dp0 indicates the "current drive letter and path"). The crf_learn and crf_test programs are in the first two directories of the current directory, so they use ..\..\.
7. Summary
Command Line (command line format, parameter, redirection)
Tune parameters (usually the c value of the training process)
Annotation set (this is important and research-related)
Template File (this is also important, research-related)
The Unigram feature and Bigram feature of the template file, as mentioned earlier, here refers to the output of one/two yuan, the application is not particularly familiar at the moment, you still need to read some paper before you can know.