Introduction to gate-based Information Extraction System

Source: Internet
Author: User
Tags svm
Gate: gate. ac. uk/

 

1 gate Introduction

Gate is an open infrastructure for widely used information extraction. It provides a graphical development environment for users and is used by many natural language processing projects, especially information extraction research projects. The system supports all aspects of language processing-from Corpus collection, tagging, reuse, and system evaluation.

The three main purposes of the gate design are:

1) provide the infrastructure for the language processing software and the overall organizational structure of text processing.

2) provides reusable components and class libraries for natural language processing, so that they can be embedded into applications processed in different languages.

3) provides a language engineering development environment, provides a convenient graphical environment for the research and development of language processing software, and provides users with comprehensive development assistance and visual debugging mechanisms.

1.1 Creole

The core of the gate platform is the reusable component-Creole (a collection of reusable objects for language engineering). Creole is implemented based on Java Bean and has three types:

Language Resources (LRS): LR can be understood as the text to be processed by IE. In gate, document objects are used to represent the texts that can be processed. Currently, XML, HTML, PDF, and other formats are supported, corpus is a collection of documents and can be processed as a whole.

Processing resources (PRS): Pr is a language processing module in gate. Different PR can complete different specific tasks, such as Word Segmentation and pattern matching.

Visualresources (VRS): VR is a visual editorial component in the GUI.

1.2 Annie

A set of all reusable resources in gate is used in the English Information Extraction System Annie (a nearly-New ie System) based on the Rule method. To put it simply, Annie is a reusable and scalable component set. The task is to complete information extraction and tagging.

In the gate GUI, Annie corresponds to the application. It concatenates a set of PR to form a pipeline and acts on a corpus or document to generate the annotation result for the text. Specifically, it is a document to be processed, which has been processed by similar pipelines, after English word segmentation, English Word Table query, English sub-sentences, English part of speech tagging, English extraction rule definition, English naming Entity recognition, and English co-finger elimination, extract information from the entire document.

The following is a simple example to illustrate Annie's information extraction process.

Annie can perform the following three steps for text annotation: "July 31,200 0:

1) tokeniser: Split to "July" "31" "," 2000"

2) gazetteer: In the date dictionary, find "July" as the month.

3) Name object syntax check (Named Entity grammar): Use the syntax rules about date (through

Defined by jape), "July 31,200 0" is recognized as a date.

1.3 jape

The function of jape (a Java annotation patterns engine) is to create a rule repository and use regular expressions to match and annotate the information in the text to implement word segmentation and accurate Named Entity recognition. Jape is a set of rule syntax files, which can be converted into a standard PR through a jape compiler provided by gate.

A jape syntax file contains several phase, each of which is composed of several rule, and each rule is composed of two parts on the left and right.

The left part (LHS, left hand side) is a regular expression operator (*,?, +. The right hand side of each rule contains the operation description of the annotation set. The label set matching the left part will be executed according to the operation on the right side. The following is an example:

Rule: gazlocation

(

{Lookup. majortype = Location}

)

: Location -->

: Location. enamex = {kind = "location", rule = gazlocation}

Here, gazlocation is the name of this Rule, --> LHS on the left and RHS on the right. In LHS, brackets indicate a match. location is the matched tag. The tag is used to pass the matching string on the left to the right for annotation. location. enamex in enamex is the annotation object in a gate. Each annotation corresponds to the map set of a feature object. Kind and rule are the feature of enamex, indicating the annotation attributes.

The built-in object lookup is actually an annotation, which is marked by gazetteer PR in Annie. gazetteer PR mainly performs a dictionary search operation. The lookup mark indicates that the word is found in the dictionary, its Attribute (feature) majortype is the category of the word defined in the dictionary. Another important object used by LHS is token, which is marked by the English tokeniser PR in Annie, indicating the text information of each word and the word represented by token. String.

RHS can also contain complex Java code for more complex tagging operations. For the required APIs, see Chapter 5 in [1, for example, for the sake of discrimination, You need to delete the already prepared labels and change them to the correct labels. For detailed usage of jape, refer to Chapter 7 in [2.

In gate, jape is used to create a jape transducer PR and pass in the jape file path as its grammarurl parameter. This PR can complete the tagging operation defined in the syntax file.

1.4 gate-based development

There are two main methods to use gate to develop an Information Extraction System: A method of adding a suitable PR (either an existing plugin or a self-compiled PR compliant with the Creole standard) to the gate GUI to form an application in the pipeline form, call it to process the LR of a document or the LR of a corpus. This method depends on the gate GUI, but it is very convenient to quickly build the prototype system and debug the program at the early stage of development.

Another method is to use the gate as the lib to build an independent program that is separated from the gate GUI. The general operation is to perform the gate framework, Annie application, and PR in the gate according to the gate API in sequence, corpus and other LR initialization (these PR and LR are essentially JavaBean), and then run and process the output results. For the sample code, see goldfish example in [3.

2 machine learning PR in gate

2.1 API Introduction

An important implementation method of information extraction is the learning system-based approach, which requires the use of statistical-based machine learning algorithms and a large number of labeled training data.

In gate, machine learning is mainly provided by machine learning PR, which encapsulates many existing learning algorithms, such as WEKA, maxent, SVM light, it is mainly used to convert the information of language features such as annotation attributes in LR into the input formats of various learning algorithms, and then call the corresponding algorithms to generate the output, then they are converted into the label format used in gate after processing, and the algorithm effect can be evaluated.

In gate machine learning, SVM-based Named Entity recognition is mainly based on [4. SVM light wrapper needs to provide the executable files svm_learn and svm_classify of SVM light [5]. They are not in the gate release package and can be downloaded separately from its project homepage. SVM light wrapper is used to mark the text in two stages. The first stage is to generate an SVM model using the labeled training data from svm_learn, in the second stage, this SVM model is used to classify pre-processed feature vectors. These two stages correspond to the two modes of ml pr: training and application.

The ml API can convert the Document Object marked in the gate into the language features and feature vectors required by SVM learner as input. files containing feature vectors are stored separately, for use elsewhere. To use the ml API, you must first provide a set of well-labeled documents, and then pre-process these documents to obtain the language features for learning, various jape-based PR in gate make these processing operations easier. Finally, a configuration file must be provided for the ml API.

2.2 PR configuration file

Ml pr uses an external xml configuration file to set specific algorithms and required datasets. The configuration file format is as follows:

<? XML version = "1.0" encoding = "Windows-1252"?>

<ML-CONFIG>

<Dataset>... </Dataset>

<Engine>... </Engine>

<ML-CONFIG>

The dataset and engine formats are as follows:

<Dataset>

<Instance-type> token </instance-type>

<Attribute>

<Name> pos_category (0) </Name>

<Type> token </type>

<Feature> Category </feature>

<Position> 0 </position>

<Values>

<Value> NN </value>

<Value> NNP </value>

<Value> nnps </value>

...

</Values>

[<Class/>]

</Attribute>

...

</Dataset>

<Engine>

<Wrapper> gate. Creole. ml. svmlight. svmlightwrapper </wrapper>

<Options>

<Classifier-Options>-C 0.7-T 0-M 100-Tau 0.4 </classifier-Options>

</Options>

</Engine>

Note that the configuration file definitions in gate V3 and V4 vary greatly, especially in the <engine> section. In <dataset>, <instance-type> is a specific annotation. Its <attribute> can be either a feature of the annotation or another annotation, its <position> subnode defines the relative position from <attribute> to <instance-type>. That is to say, this node definition can reflect the relative relationship between several annotation instances. <Attribute> there are three types: Boolean, nominal, and numeric. The nominal type must list all possible values in the <value> node.

Reference:

[1] gate programmers guide [Eb/Ol]. http://gate.ac.uk/sale/pg/pg.pdf ,.

[2] developing language processing components with gate version 4 (a user guide) [Eb/Ol]. http://gate.ac.uk/sale/tao/index.html ,.

[3] gate example code [Eb/Ol]. http://gate.ac.uk/gate-examples/doc/index.html ,.

[4] Li Y, bontcheva K, Cunningham H. SVM based learning system for information extraction [J]. deterministic and statistical methods in machine learning, 2005,: 319 ~ 339.

[5] SVM light [Eb/Ol]. http://svmlight.joachims.org ,.

[6] Development and Implementation of gate-based Chinese Information Extraction System [Master's Degree]. Beijing: Graduate School of the Chinese Emy of Sciences (document and Intelligence Center), 2006.

 

Table issuer Tully wrote

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.