crf++ named entity recognition

Source: Internet
Author: User

using the crf++ version of Windows

Doc file: Is the content of the official homepage.

Example folder: Training data, test data, and template files with four tasks.

SDK folder: crf++ header file and static link library.

Crf_learn.exe:crf++ 's training program

Crf_test.exe:crf++ 's Predictive program

Libcrfpp.dll: A static link library that the training program and the predictive program need to use.

In fact, the three files that need to be used are crf_learn.exe,crf_test.exe and Libcrfpp.dll.

You can copy the Crf_learn.exe,crf_test.exe,libcrfpp.dll three files into the folder of the corpus to be trained, such as the EXAMPLE/TRAINRM training program

Command line:

% Crf_learn Template_file Train_filemodel_file

Information such as the time and number of iterations of the training process is output to the console (feeling that the output information of the Crf_learn program is on the standard output stream), and if you want to save these standard output streams to a file, the command format is as follows:

% Crf_learn template_file train_filemodel_file >>train_info_file


The information in the training process is as follows:



There are 4 main parameters that can be adjusted:

-A CRF-L2 or CRF-L1

Normalization algorithm selection. The default is Crf-l2. In general, the L2 algorithm has a slightly better effect than the L1 algorithm, although the value of the non-0 features of the L1 algorithm is significantly smaller than the L2.

-C Float

This parameter sets the hyper-parameter of the CRF. The greater the value of C, the higher the degree of CRF fitting training data. This parameter adjusts the degree of balance between overfitting and non-fitting. This parameter can be used to find better parameters by means of cross-validation.

-F NUM

This parameter sets the cut-off threshold of the feature. Crf++ uses the characteristics of at least NUM occurrences in the training data. The default value is 1. When using crf++ to large-scale data, only one occurrence of the feature may be millions of, this option will play a role in this case.

-P NUM

If your computer has multiple CPUs, you can increase the training speed through multithreading. Num is the number of threads.

Command-line example with two parameters:

% crf_learn–f 3–c 1.5 template_filetrain_file Model_file


A) training of corpus You can use commands (in terminal or DOS command line): Crf_learn < templates > < training Corpus > < template files >

Templates and training corpus are prepared in advance, and template files are generated after training is completed.

Attention:

1) If prompted Corpus format error, then pay attention to check the data storage encoding, some encoding crf++ is read error;

2) file path is correct, if the file is not in the current directory, then use absolute path.

b) Description of some of the parameters in the training:

ITER: Number of iterations

Terr: Mark Error Rate

SERR: Word Error rate

OBJ: The value of the current object. When this value converges to a certain value, the training is done

diff: The relative difference between the previous object value and the test program

Command line:

%crf_test–m Model_file Test_files

There are two parameters------------------------to display the probability value of the predicted label,-N to show the probability of different possible sequences, for accuracy, recall, operation Efficiency, no impact, not explained here.

Similar to Crf_learn, the output is placed on the standard output stream, and this output is the most important predictor information (the content of the test file + the predictive callout), as well as the use of redirection to save the results, the command line as follows.

%crf_test–m Model_file Test_files>>result_file

usage of the evaluation tool CONLL 2000

Before using the evaluation tool, you should convert all the table bits in the evaluation file into spaces, otherwise the evaluation tool will error.

The evaluation command is: Perl conlleval.pl < < profiling documents >

file Format Training Files

The training file consists of several sentences (which can be understood as several training samples), separated by a newline character between the sentences, and two sentences shown in the image above. Each sentence can be a group of labels, the last set of labels is labeled, the above image has three columns, namely the first column and the second column are known data, the third column is to predict the label, in the example above, according to the first column of words and the second column of speech, the third column of the prediction of the label.

Training Corpus Format:

A) The training corpus should have at least two columns, with spaces or tab stops spaced between columns, and all rows (except for empty lines) must have the same number of columns. Use a blank line interval between sentences.

b

Types mainly include:

Person Name: PER

Place Name: LOC

Organization Name: ORG

Each type of entity starts with a B-, such as the beginning of a place name, represented by B-loc, and if a place name includes more than one word, the following word is labeled I-loc

If a word does not belong to three types of entity names, it is labeled N

The selection of features and the preparation of templates

A) Feature selection of the row is relative, the column is absolute, the general selection of the relative lines before and after M row, select n-1 column (assuming that the corpus has n columns), characterized by:%x[rows, columns], the initial position of the column is 0 test files

The test file and the training file format are naturally the same, using the Machine learning Toolkit this is generally understood.

Unlike SVM, crf++ does not have a separate result file, and the predictions are output through the standard output stream, so the results are redirected to the file in the previous command line. The result file is a column more than the test file, that is, for the prediction of the label, we can calculate the last two columns, a column of labels, a column of the predicted label, to get the accuracy of the label prediction.

Summary

Command line (command-line format, parameters, redirection)

Tuning parameters (usually the C value of the training process)

Annotation sets (this is important, research-related) template files (This is also important, research-related)



























Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.