1Gate Introduction
Gate is an open infrastructure for widely used information extraction. It provides a graphical development environment for users and is used by many natural language processing projects, especially information extraction research projects. The system supports all aspects of language processing-from Corpus collection, tagging, reuse, and system evaluation.
The three main purposes of the gate design are:
1) provide the infrastructure for the language processing software and the overall organizational structure of text processing.
2) provides reusable components and class libraries for natural language processing, so that they can be embedded into applications processed in different languages.
3) provides a language engineering development environment, provides a convenient graphical environment for the research and development of language processing software, and provides users with comprehensive development assistance and visual debugging mechanisms.
1.1Creole
The core of the gate platform is the reusable component-Creole (a collection of reusable objects for language engineering). Creole is implemented based on Java Bean and has three types:
Language Resources (LRS): LR can be understood as the text to be processed by IE. In gate, document objects are used to represent the texts that can be processed. Currently, XML, HTML, PDF, and other formats are supported, corpus is a collection of documents and can be processed as a whole.
Processing resources (PRS): Pr is a language processing module in gate. Different PR can complete different specific tasks, such as Word Segmentation and pattern matching.
Visualresources (VRS): VR is a visual editorial component in the GUI.
1.2Annie
A set of all reusable resources in gate is used in the English Information Extraction System Annie (a nearly-New ie System) based on the Rule method. To put it simply, Annie is a reusable and scalable component set. The task is to complete information extraction and tagging.
In
In the guis of gate, Annie corresponds to the application. It concatenates a set of PR to form a pipeline to act on a corpus or document
To generate text annotation results. Specifically, it is a document to be processed, which is processed in a similar pipeline. The documents are segmented by English word, queried by English Word Table, sub-sentence, and part of speech in strict order.
Annotation, English extraction rule definition, English naming Entity recognition, and English co-finger digestion are used to extract information from the entire document.
The following is a simple example to illustrate Annie's information extraction process.
Annie can perform the following three steps for text annotation: "July 31,200 0:
1) tokeniser: Split to "July" "31" "," 2000"
2) gazetteer: In the date dictionary, find "July" as the month.
3) Name object syntax check (Named Entity grammar): Use the syntax rules about date (through
Defined by jape), "July 31,200 0" is recognized as a date.
1.3Jape
Jape (
Java annotation patterns
Engine) is used to create a rule repository. Regular expressions are used to match the information in the text and mark the information, which is used to achieve word segmentation and sentence segmentation and more accurate Named Entity recognition. Jape is represented as a set of rule syntax
This syntax file can be converted into a standard PR through a jape compiler provided in gate.
A jape syntax file contains several phase, each of which is composed of several rule, and each rule is composed of two parts on the left and right.
Left
The side (LHS, left hand side) is a regular expression operator (*,?, +. Right
Hand Side) contains the annotation set operation description. The label set matching the left part will be executed according to the operation on the right side. The following is an example:
Rule: gazlocation
(
{Lookup. majortype = Location}
)
: Location -->
: Location. enamex = {kind = "location", rule = gazlocation}
Its
In gazlocation is the name of this Rule, --> the left side is LHS, and the right side is RHS. The brackets in LHS indicate a match, followed by location.
The tag is used to pass the matching string on the left to the right for annotation (annotation). The enamex in location. enamex is
Annotation object in gate. Each annotation corresponds to a map set of feature objects. Kind and rule are
Feature, indicating the annotation attribute.
The built-in object lookup is actually an annotation, which is composed of gazetteer in Annie.
Gazetteer
PR mainly performs a dictionary search operation. The lookup mark indicates that the word is found in the dictionary. Its Attribute (feature) majortype is the class of the word defined in the dictionary.
No. Another important object used by LHS is token, which is the English tokeniser in Annie.
On the PR annotation, it indicates the text information of each word and the word represented by token. String.
RHS can also contain complex Java code for more complex tagging operations. For the required APIs, see Chapter 5 in [1, for example, for the sake of discrimination, You need to delete the already prepared labels and change them to the correct labels. For detailed usage of jape, refer to Chapter 7 in [2.
In gate, jape is used to create a jape transducer PR and pass in the jape file path as its grammarurl parameter. This PR can complete the tagging operation defined in the syntax file.
1.4Gate-based development
Enable
There are two main methods to use gate to develop information extraction systems: one is to add the appropriate PR in the gate gui (either the existing plugin or the self-writing that meets the Creole standard ).
A pipeline application that calls the LR of a document or the LR of a corpus. This method depends on the gate
Gui, but it is very convenient to quickly build a prototype system and Debug Programs at the initial stage of development.
The other method is to use the gate as the lib to build the exit gate.
The independent program of the GUI. In this case, the general operation is to follow the gate API
Applications and their PR, corpus, and other LR initialization (these PR and LR are essentially JavaBean), and then run and process the output results. The sample code can be used as a parameter.
Goldfish example in [3.
2Machine Learning PR in gate
2.1Ml api Introduction
An important implementation method of information extraction is the learning system-based approach, which requires the use of statistical-based machine learning algorithms and a large number of labeled training data.
In
Machine Learning PR in gate is mainly provided by machine learning PR, which encapsulates many existing learning algorithms, such as WEKA, maxent, and SVM.
Light, which converts the information of language features such as annotation attributes in lr to the input formats of various learning algorithms, and then calls the corresponding algorithms for output, then convert them
In addition, the algorithm effect can be evaluated.
In gate machine learning, SVM-based Named Entity recognition is mainly based on [4. SVM light wrapper needs to provide SVM light [5]
The executable files svm_learn and svm_classify are not in the gate release package. They can be downloaded separately from the project homepage. Use SVM light
Wrapper is used to mark the text in two stages. The first stage is to generate an SVM model from svm_learn using the labeled training data. The second stage uses this SVM model for preprocessing.
. These two stages correspond to the two modes of ml pr: training and application.
ML
The API can convert the Document Object marked in gate into SVM as input.
Language Features and feature vectors required by learner. files containing feature vectors are stored separately for use elsewhere. To use ml
API, you first need to provide a set of well-labeled documents, and then you need to pre-process these documents to obtain the language features for learning. Various jape-based PR in gate make these processing operations change
Finally, you need to provide a configuration file for the ml API.
2.2Configuration file of ML PR
Ml pr uses an external xml configuration file to set specific algorithms and required datasets. The configuration file format is as follows:
<? XML version = "1.0" encoding = "Windows-1252"?>
<ML-CONFIG>
<Dataset>... </Dataset>
<Engine>... </Engine>
<ML-CONFIG>
The dataset and engine formats are as follows:
<Dataset>
<Instance-type> token </instance-type>
<Attribute>
<Name> pos_category (0) </Name>
<Type> token </type>
<Feature> Category </feature>
<Position> 0 </position>
<Values>
<Value> NN </value>
<Value> NNP </value>
<Value> nnps </value>
...
</Values>
[<Class/>]
</Attribute>
...
</Dataset>
<Engine>
<Wrapper> gate. Creole. ml. svmlight. svmlightwrapper </wrapper>
<Options>
<Classifier-Options>-C 0.7-T 0-M 100-Tau 0.4 </classifier-Options>
</Options>
</Engine>
Note
Italian Gate
The configuration file definitions in V3 and V4 vary greatly, especially in the <engine> section. <Dataset> medium <instance-
Type> is a specific annotation. Its <attribute> can be either a feature of the annotation or
Is another annotation, its <position> sub-node defines the <attribute> to <instance-
Type>, that is, the node definition can reflect the relative relationship of several annotation. <Attribute> there are three types:
Type: Boolean, nominal, numeric. The nominal type must list all possible values in the <value> node.
Reference:
[1] gate programmers guide [Eb/Ol]. http://gate.ac.uk/sale/pg/pg.pdf ,.
[2]
Developing language processing components with gate version 4 (a user
Guide) [Eb/Ol]. http://gate.ac.uk/sale/tao/index.html ,.
[3] gate example code [Eb/Ol]. http://gate.ac.uk/gate-examples/doc/index.html ,.
[4]
Li Y, bontcheva K, Cunningham H. SVM Based Learning System
Information Extraction [J]. deterministic and statistical methods in
Machine learning, 2005,: 319 ~ 339.
[5] SVM light [Eb/Ol]. http://svmlight.joachims.org ,.
[6] Development and Implementation of gate-based Chinese Information Extraction System [Master's Degree]. Beijing: Graduate School of the Chinese Emy of Sciences (document and Intelligence Center), 2006.
Sender
Tully
Written on