using the crf++ version of Windows
Doc file: Is the content of the official homepage.
Example folder: Training data, test data, and template files with four tasks.
SDK folder: crf++ header file and static link library.
Crf_learn.exe:crf++ 's training program
Crf_test.exe:crf++ 's Predictive program
Libcrfpp.dll: A static link library that the training program and the predictive program need to use.
In fact, the three files that need to be used are crf_learn.exe,crf_test.exe and Libcrfpp.dll.
You can copy the Crf_learn.exe,crf_test.exe,libcrfpp.dll three files into the folder of the corpus to be trained, such as the EXAMPLE/TRAINRM training program
Command line:
% Crf_learn Template_file Train_filemodel_file
Information such as the time and number of iterations of the training process is output to the console (feeling that the output information of the Crf_learn program is on the standard output stream), and if you want to save these standard output streams to a file, the command format is as follows:
% Crf_learn template_file train_filemodel_file >>train_info_file
The information in the training process is as follows:
There are 4 main parameters that can be adjusted:
-A CRF-L2 or CRF-L1
Normalization algorithm selection. The default is Crf-l2. In general, the L2 algorithm has a slightly better effect than the L1 algorithm, although the value of the non-0 features of the L1 algorithm is significantly smaller than the L2.
-C Float
This parameter sets the hyper-parameter of the CRF. The greater the value of C, the higher the degree of CRF fitting training data. This parameter adjusts the degree of balance between overfitting and non-fitting. This parameter can be used to find better parameters by means of cross-validation.
-F NUM
This parameter sets the cut-off threshold of the feature. Crf++ uses the characteristics of at least NUM occurrences in the training data. The default value is 1. When using crf++ to large-scale data, only one occurrence of the feature may be millions of, this option will play a role in this case.
-P NUM
If your computer has multiple CPUs, you can increase the training speed through multithreading. Num is the number of threads.
Command-line example with two parameters:
% crf_learn–f 3–c 1.5 template_filetrain_file Model_file
A) training of corpus You can use commands (in terminal or DOS command line): Crf_learn < templates > < training Corpus > < template files >
Templates and training corpus are prepared in advance, and template files are generated after training is completed.
Attention:
1) If prompted Corpus format error, then pay attention to check the data storage encoding, some encoding crf++ is read error;
2) file path is correct, if the file is not in the current directory, then use absolute path.
b) Description of some of the parameters in the training:
ITER: Number of iterations
Terr: Mark Error Rate
SERR: Word Error rate
OBJ: The value of the current object. When this value converges to a certain value, the training is done
diff: The relative difference between the previous object value and the test program
Command line:
%crf_test–m Model_file Test_files
There are two parameters------------------------to display the probability value of the predicted label,-N to show the probability of different possible sequences, for accuracy, recall, operation Efficiency, no impact, not explained here.
Similar to Crf_learn, the output is placed on the standard output stream, and this output is the most important predictor information (the content of the test file + the predictive callout), as well as the use of redirection to save the results, the command line as follows.
%crf_test–m Model_file Test_files>>result_file
usage of the evaluation tool CONLL 2000
Before using the evaluation tool, you should convert all the table bits in the evaluation file into spaces, otherwise the evaluation tool will error.
The evaluation command is: Perl conlleval.pl < < profiling documents >
file Format Training Files
The training file consists of several sentences (which can be understood as several training samples), separated by a newline character between the sentences, and two sentences shown in the image above. Each sentence can be a group of labels, the last set of labels is labeled, the above image has three columns, namely the first column and the second column are known data, the third column is to predict the label, in the example above, according to the first column of words and the second column of speech, the third column of the prediction of the label.
Training Corpus Format:
A) The training corpus should have at least two columns, with spaces or tab stops spaced between columns, and all rows (except for empty lines) must have the same number of columns. Use a blank line interval between sentences.
b
Types mainly include:
Person Name: PER
Place Name: LOC
Organization Name: ORG
Each type of entity starts with a B-, such as the beginning of a place name, represented by B-loc, and if a place name includes more than one word, the following word is labeled I-loc
If a word does not belong to three types of entity names, it is labeled N
The selection of features and the preparation of templates
A) Feature selection of the row is relative, the column is absolute, the general selection of the relative lines before and after M row, select n-1 column (assuming that the corpus has n columns), characterized by:%x[rows, columns], the initial position of the column is 0 test files
The test file and the training file format are naturally the same, using the Machine learning Toolkit this is generally understood.
Unlike SVM, crf++ does not have a separate result file, and the predictions are output through the standard output stream, so the results are redirected to the file in the previous command line. The result file is a column more than the test file, that is, for the prediction of the label, we can calculate the last two columns, a column of labels, a column of the predicted label, to get the accuracy of the label prediction.
Summary
Command line (command-line format, parameters, redirection)
Tuning parameters (usually the C value of the training process)
Annotation sets (this is important, research-related) template files (This is also important, research-related)