http://blog.csdn.net/heavendai/article/details/7228524
1. Brief description
A CRF model has recently been applied to sequence recognition. The crf++ Toolkit is selected, specifically in the VS2008 C # environment, using the crf++ version of Windows. This article summarizes the information that is relevant to the crf++ Toolkit.
Reference is the official website of crf++: crf++: Yet another CRF toolkit, many of the online blog about crf++ is the whole or part of this article translation, this article also translated some.
2. Tool Pack Download
First, the version selection, the current version is the 2010-05-16 update of the crf++ 0.54 version, but this version of the previous I used to run as if there are some problems, some people on the internet said there is a problem, so here is the 2009-05-06:crf++ 0.53 version. Information about running an error is http://ir.hit.edu.cn/bbs/viewthread.php?action=printable&tid=7945.
Second, the file download, this home page only the latest 0.54 version of the file, the Internet can search, but not a lot of resources, I downloaded a crf++0.53 version of CSDN above, including Linux and Windows version, it will cost 10 points. Because, I did not find the more stable, long-term, free link, here Upload a copy of this file: crf++ 0.53 Linux and Windows version.
3. Toolkit files
Doc folder: Is the content of the official homepage.
Example folder: Training data, test data, and template files with four tasks.
SDK folder: crf++ header file and static link library.
Crf_learn.exe:crf++ 's training program.
Crf_test.exe:crf++ 's Predictive program
Libcrfpp.dll: A static link library that the training program and the predictive program need to use.
In fact, the three files that need to be used are crf_learn.exe,crf_test.exe and Libcrfpp.dll.
4. command-line format
4.1 Training Procedures
Command line:
% Crf_learn template_file train_file model_file
information such as the time and number of iterations of the training process is output to the console (it feels like the output of the Crf_learn program is on the standard output stream), and if you want to save this information, we can stream the standard output to a file with the following command format:
% crf_learn template_file train_file model_file >> train_info_file
There are four main parameters that can be adjusted:
-A crf-l2 or crf-l1
Normalization algorithm selection. The default is CRF-L2. Generally speaking, the L2 algorithm is slightly better than the L1 algorithm, although the L1 algorithm has a significantly smaller value than the L2.
- C Float
This parameter sets the hyper-parameter of the CRF. The greater the value of C, the higher the degree of CRF fitting training data. This parameter adjusts the degree of balance between overfitting and non-fitting. This parameter can be used to find better parameters by means of cross-validation.
- F NUM
This parameter sets the cut-off threshold of the feature. Crf++ uses the characteristics of at least NUM occurrences in the training data. The default value is 1. When using crf++ to large-scale data, only one occurrence of the feature may be millions of, this option will play a role in this case.
- P NUM
If your computer has multiple CPUs, you can increase the training speed through multithreading. Num is the number of threads.
Command-line example with two parameters:
% crf_learn-f 3-c 1.5 template_file train_file model_file
4.2 Test procedure
Command line:
% crf_test-m model_file test_files
There are two parameters-------------------------to display the probability value of the predicted label,-N to show the probability of different possible sequences, for accuracy, recall, operation Efficiency, no impact, not explained here.
Similar to Crf_learn, the output is placed on the standard output stream, and this output is the most important predictor information (the content of the test file + the predictive callout), as well as the use of redirection to save the results, the command line as follows.
% crf_test-m model_file test_files >> result_file
5. File format
5.1 Training Files
Here is an example of a training file:
The training file consists of several sentences (which can be understood as a few training samples), separated by a newline character between the sentences, showing two sentences. Each sentence can have several sets of labels, the last set of labels is a label, there are three columns, that is, the first and second columns are known data, the third column is to predict the label, in the example above, according to the first column of words and the second column of speech, the third column to predict the label.
Of course, there are issues related to labeling, this is a lot of paper to study, such as named entity recognition has a lot of different annotation sets. This is beyond the scope of this document.
5.2 Test Files
The test file and the training file format are naturally the same, using the Machine learning Toolkit is generally understood.
Unlike SVM, crf++ does not have a separate result file, and the predictions are output through the standard output stream, so the results are redirected to the file in the previous 4.2 section of the command line. The result file is a column more than the test file, that is, for the prediction of the label, we can calculate the last two columns, a column of labels, a column of the predicted label, to get the accuracy of the label prediction.
5.3 Template files
5.3.1 Template Basics
Each row in the template file is a template. Each template is specified by%x[row,col] to specify a token in the input data. row specifies the row offset to the current token, and col specifies the column position.
By visible, the current token is the word. %x[-2,1] is the first two lines, the element of column 1th (note that the column starts from column No. 0), which is the PRP.
5.3.2 Template Types
There are two types of templates, and the template type is specified by the first character.
unigram template:first character, ' U '
when a template for "u01:%x[0,1" is given, crf++ produces some of the following set of feature functions (Func1 ... funcN).
These functions let me explain,%x[0,1] This feature to the previous example is that according to the word (1th column) of the part of speech (2nd column) to predict its labeling (column 3rd), these functions are reflected in the training sample case, Func1 reflects the "training sample, part of speech is DT and labeled is B-NP case" , Func2 reflects "in the training sample, part of speech is DT and labeling is I-NP".
The number of template functions is l*n, where L is the number of categories in the label set, and N is the type of string that is processed from the template extension.
bigram template:first character, ' B '
This template is used to describe the two-dollar feature. This template automatically generates a merge of the current output token and the previous output token. Note that this type of template produces different characteristics of L * l * N.
What is the difference between Unigram feature and Bigram feature?
Unigram/bigram is easy to confuse, because Unigram-features can also write a word-level bigram (two-dollar feature) like%x[-1,0]%x[0,0]. The Unigram and Bigram features here specify the output label of the uni/bigrams.
Unigram: |output tag| x |all possible strings expanded with a macro|
Bigram: |output tag| X |output tag| x |all possible strings expanded with a macro|
Here a yuan/two yuan refers to the output of the label case, this specific example I have not seen, example folder four examples, also are only used Unigram, no use bigarm, so feel General Unigram feature is enough.
5.3.3 Template Examples
This is a template example of the CONLL 2000 BASE-NP chunking task. Only one bigram template (' B ') was used. This means that only the previous output token and the current token are treated as bigram features. The line that starts with "#" is a comment, and the empty line has no meaning.
6. Sample Data
There are four tasks in the example folder, Basenp,chunking,japanesene,seg. The first two are English data, the latter two are Japanese data. The first should be named entity recognition, the second should be a word breaker, the third should be a Japanese named entity recognition, the fourth is unclear. Here mainly ran the first two tasks, the latter two is the Japanese do not understand.
According to the Linux footstep files under the task, I wrote a simple Windows batch (which saved the information with redirection), such as Exec.bat, and ran a bit. The batch file is placed under the path of the task you want to run, and the contents of the batch file are as follows:
.. \.. \CRF_LEARN-C 10.0 Template Train.data model >> train-info.txt
.. \.. \crf_test-m model Test.data >> test-info.txt
Here is a brief explanation of the batch process, the current directory after the batch file is running is the directory where the batch file is located (at least I do, if not, you can use the CD%~dp0 This command, ~DP0 represents "current drive letter and path"), Crf_learn and Crf_ The test program is in the top level two directory of the current directory, so it uses the. \.. \。
7. Summary
Command line (command-line format, parameters, redirection)
Tuning parameters (usually the C value of the training process)
Annotation sets (this is important, research related)
Template files (This is also important, research related)
Template file Unigram feature and Bigram feature, the front also said, here refers to the output of a yuan/two yuan, the application of the situation is not a special understanding, but also need to see some paper may be known.
Transferred from: http://www.cnblogs.com/pangxiaodong/archive/2011/11/21/2256264.html
crf++ use summary (GO)