Research on Key Techniques of Chinese event extraction (Dr. Tan Hongye)

Source: Internet
Author: User
Tags svm
Definition of Key Techniques for Chinese event extraction (Dr. Tan Hongye)

Ace2005 defines this task as: identifying specific types of events and determining and extracting relevant information. The main information includes the event type and subtype, and event meta-role. According to this definition, event extraction tasks can be divided into two core subtasks: (1) event detection and type recognition; (2) event theory meta-role extraction. In addition, because most of the meta-roles are entities, Entity recognition is also a basic task of event extraction.

Definition of Information Extraction

The definition proposed by Andrew McCallum has universal significance. He defines information extraction as (A. McCallum. Information Extraction: distilling structured data from unstructured text. ACM queue 2005,200 5: 49 ~ 57 ):

Fill in database fields from unstructured or loose structures and form records. The following subtasks are involved: ① segmentation: Mainly used to identify the starting position of the text segment that can fill the database fields. ② Classification: It mainly determines what data warehouse word segment is required for filling in this segment. Generally, segmentation and classification are performed simultaneously in the system.

Limitations of Information Extraction

Information Extraction systems still have some limitations: 1. low accuracy; 2. unportable; 3. Uncertain component control

Information Extraction Methods

Main Methods:

1. Based on the knowledge engineering method, the linguistics and field experts can observe some relevant document sets and manually write certain rules based on the extraction task to extract relevant information. (March 1990s) Easy To Format Text

2. Statistics and machine learning

2.1 Restrictions of Rule-based (decision tree rules): The expression ability of patterns is limited; it is difficult to obtain complex or cross-sentence patterns.

2.2 Statistical-Based Method: a large number of simple features are used to share a variety of detailed features. Hmm, CRF, memm, and Nb.

A mix of more than 2.3 machine learning methods.

Information Extraction and Development of excellent teams:

Cymfony, bhasha, linguamatics, revsolutions, New York University, University of California, University of u his state, University of Washington, etc. Some institutions are also conducting research in the UK, Germany, Italy, Ireland and other countries.

Excellent system:

Autoslog, crystal, Proteus, Wien, softmealy, stalker, whisk, SRV, Rapier

Precision text format

Very regular: (databases, web pages generated by databases) almost perfect performance

Regular (News, etc.) 95%

Irregular:

The precision of link extraction is generally 60%.

Analysis of research trends

In the future, the focus of research should be on machine learning, this allows the system to easily apply to new fields and new data formats with minimal manual intervention and to quickly process a large collection of documents that are not limited by formats and fields.

(1) simple training and semi-supervised learning.

(2) Interactive Extraction.

(3) uncertainty estimation and management of multiple assumptions.

Core task of event Extraction

Recognition of event mention, determination of event attributes, and identification of argument roles.

Event attribute information: (type, subtype [important]), mode, tendency, universality, and tense.

Argument role: entity, value, and time.

Main Event extraction methods

1. Mixed Methods Based on Multiple Machine Learning (multiple subtasks)

2. Semi-supervised and unsupervised learning methods

Entity Recognition Method

(1) Rule-based methods. In the early days of name Entity recognition systems, most of them adopted this method, specifically: decision tree method, conversion-based method, and grammar method.

(2) geometric space-based discriminant method. These include: SVM, fisher discriminant analysis, and neural network. (I am more interested)

(3) probability statistics-based methods. It is a mainstream method and technology for name Entity recognition. Specifically, there are: Bayes discriminant method, n-gram model, HMM model, me model, memm model and CRF model.

Key Methods of semi-Supervised Learning

Self-training, co-training, transductive SVM, and graph-based methods.

Self-training/self-teaching/boostrapping)

Its main idea is: first train an initial classifier using a small amount of labeled data or an initial seed set, and then use the initial classifier to classify unlabeled data, add the data with the highest credibility to the labeled data. Repeat the above process on the expanded labeled dataset until a more precise classifier is obtained.

Limitations: (1) different initial seeds, different classifier performance, and different classifier convergence speeds. (2) classification errors during bootstrapping will be gradually amplified during the self-training process and cause the process to fail. Therefore, the evaluation and selection of initial seed and new labeled instances are the key to this algorithm. (In the past, when I was unsupervised, I didn't realize that this was a mature method. It seems that there are too few things to see)

Seed Selection: instance seed or mode seed.

Rating function: the simplest is count or probability.

Mode

The Mode in Information Extraction refers to the language expression that can transmit the relationship and event information in a specific domain.

In information extraction, the mode consists of multiple items or slots, including extraction items, trigger items, and constraint items. An extraction item is also called a target item. A Constraint item is also called a constraint condition. It is mainly used to determine the relevant information of the target item in the text to ensure that the extracted information is accurate. Constraints mainly include syntax constraints and semantic constraints. The trigger item is used to trigger the match of a pattern in a text fragment.

(In three modes, are the extracted content thrown into the classifier ?)

Different models are mainly manifested in the following aspects:

(1) The extraction granularity is different. Some modes can extract Accurate target items directly, while some modes extract the syntaxes that contain the target items.

(2) The constraint strength is different. If the pattern has more constraints and more semantic constraints, the stronger the constraint intensity is. As the constraint strength increases, the rigor of the model increases, which ensures the accuracy of the extracted target item, but the expression or coverage of the model decreases.

(3) The extraction efficiency is different. Some modes can extract multiple target items at a time, while some modes can only extract one target item at a time. The former mode is called the multi-slot Extraction mode, while the latter is called the single-slot Extraction mode. For example, Mode 1 is a single-slot Extraction mode. If this mode is used, the system needs to generate a corresponding mode for each target item.

Bootstraping Method

(1) manually create the initial seed set sseed, the candidate mode set pcand = NULL, and the available mode set paccepted = NULL.

(2) The extraction mode is added to the establishment of the candidate mode set pcand. Based on the seed set sseed, context mode with a window size of L is extracted from the training corpus and added to the candidate mode set pcand.

(3) Add the selected mode to the available mode set paccepted. Use a certain evaluation function fpattern to calculate the scores of each mode in the pcnd of the candidate mode set, and sort the scores by the score Pair mode. Add the mode that meets certain conditions to the available mode set paccepted.

(4) use the available mode set paccepted to identify relevant entities and form an icand set of candidate instances.

(5) Determine whether the iteration ends. If the set of candidate instances is stable, that is, no new entity names are identified, or a certain number of iterations are satisfied, or the available mode set reaches a certain scale, the cycle ends; otherwise, execute (6 ).

(6) Identify new seeds Based on trusted instances. First, sseed = NULL, and then use a certain evaluation function finstance to calculate the scores of each instance in the icand of the candidate instance set, and sort the instances by the scores. Instances that meet certain conditions are trusted instances and added to the seed set sseed.

(7) return to step (2) to continue the loop.

The formula (2-1) is the evaluation of the model. Numcommword (PJ) indicates the number of common words extracted by mode PJ, and the common word is the word indexed by the dictionary. Totalnumterm (PJ) is the total number of target items extracted by mode PJ. This formula uses the ratio of common words extracted from the model to the number of all extracted items to score the model. This formula indicates that the more common words a mode extracts, the weaker the model's indicator of the target item, that is, the lower the pattern recognition accuracy of the target item.

The formula (2-2) is the evaluation of the instance. Pi is any pattern of extracting entity nej in this iteration, and N is the total number of all pattern of extracting entity nej in this iteration. This formula extracts the reliability degree of the instance mode to evaluate the instance reliability.

Mode Generalization

It is generally implemented by relaxing the constraints of the pattern, such as shortening the pattern length and replacing the morphological information with part of speech or semantic tags.

Hard pattern and hard match: if the format of the pattern is fixed and exact match is required during the pattern match, this pattern is called hard pattern, the exact match is called Hard match. For example, the model set extracted in part 1 belongs to the hard mode set. (Regular expression ?)

Soft pattern: Soft pattern is called if the pattern is flexible and exact match is not required during pattern matching, the corresponding matching is called soft match. Soft Mode:

<Token-l, I, W-L, I>... <Token-1, 1, W-1, 1> interest_class <token +, W +>... <Token + L, I w + L, I>

Where, Token-l, I represents any information that may appear in the L slot, such as: Word shape, part of speech and semantic category information, W-L, I is the weight, indicates the degree of importance of token-L and I.

Similar to the hard mode, the soft mode also consists of multiple slots. In addition, the information of token-L and I is similar to that of hard mode.

The main differences between the soft mode and the hard mode are as follows:

(1) Each slot contains the weight information wl, which indicates the importance of tokenl and I. In general, the definition of WL and I varies depending on the needs, which can be probability, similarity, error rate, and so on.

(2) Each slot in the hard mode is extended into a bag of words (BOW) in the soft mode ). That is, multiple words may appear in each slot, and each word has different weights.

(3) pattern matching is different. Hard mode requires hard match, and all slot information must be exactly matched. Because the soft mode includes weight information, soft matching can be achieved through similarity calculation or probability calculation, that is, fuzzy matching.

WL, I = P (tokenl, I) = num (tokenl, I)/totalnum (token_in_slotl)

In this example, num (tokenl, I) is the number of occurrences in slotl, and totalnum (token_in_slotl) is the number of all tokens in slotl. These parameters can be obtained through the hard mode set.

2-4 calculate the uni-gram Union probability; 2-5 Calculate the Bi-gram Union Probability

Soft Mode:

Conflict Arbitration

(1) priority is given to those with higher joint probability. If the joint probability of candidate Entity A is greater than that of candidate entity B, candidate entity a is the final recognition result. When the formula (2-4) is used to calculate the joint probability, the opposite number of its logarithm is obtained for the probability value. If the difference between the two is greater than 2, it is considered as "much greater ".

(2) The sum of the joint probability and the binary co-occurrence probability is preferred. If the rule (1) is not met, the sum of the sequence probability and joint probability of candidate entity a and entity B is calculated, and the sum is the final recognition result.

(3) The entity length is preferred. If rules (1) and (2) cannot be met, the person with a large length is used as the final recognition result.

Convert soft mode to vector feature

Conflict Arbitration

(1) priority is given to those with high similarity. If the similarity of candidate Entity A is greater than that of candidate entity B, candidate entity a is the final recognition result.

(2) The entity length is preferred. If the rule (1) cannot be met, the person with a large length is used as the final recognition result.

Using cos as the similarity indicator

Unbalanced data

Because the ace Corpus has a small scale and unbalanced category distribution (Data imbalance), the proposed event detection and classification method should be able to overcome the imbalance of category distribution. Many people try to solve the data skew problem. Someone suggested a certain strategy to reduce the number of counterexamples to obtain more balanced data (Z. h. zheng, x.y. wu, R. srihari. feature Selection for text categorization on imbalanced data. sigkdd tolerations, 2004, 6 (1): 80-89;) Someone suggested converting the problem into a classification problem that is not affected by category distribution (SU jinshu, Zhang bofeng, Xu Xin, research Progress of machine learning-based text classification technology, Journal of software, 2006, 17 (9): 1848-1859 ); some people think that feature selection is more important than Classification Algorithms in unbalanced data (G. forman. a pitfall and solution in multi-class feature selection for text classification. proceedings of the 21st internation Al Conference on Machine Learning (icml2004), Banff, Canada, Morgan Kaufmann Publishers, 2004 (9): 38-46 ). Here we try to use a good feature selection policy to overcome data imbalance and complete event detection and classification.

Sentence Representation

In natural language processing, text representation models mainly include Boolean model, vector space model, latent semantic model, probability model, and n-gram model.

If the feature item is a word, the vector corresponding to a text is also called bag of words (BOW ). Many studies have shown that data representation forms that are more complex than bow (for example, using phrases as feature items) cannot effectively improve the performance of classifier. Therefore, bow has become a standard method for text representation in NLP.

Feature Selection

In practice, the dimension of the feature space (feature space) is usually very large. The excessively high feature dimension not only affects the speed of the classifier, but also leads to overfitting) and not every feature in the feature space has a significant effect on classification. Therefore, it is particularly important to reduce the dimension of the feature space through an effective method. The main dimensionality reduction methods include feature selection and feature extraction. Feature Selection refers to selecting features from the original feature set through a certain method to form a new feature subset. Feature Extraction refers to generating new features from the original feature set using certain policies to form a new feature set. This paper uses the feature selection method to reduce the feature dimension and improve the performance of the classifier.

The feature selection method is divided into wrapper mode and filter mode. The wrapper method performs the best in feature selection. Popular feature filtering methods used in text fields include document frequency (DF), information gain (IG), and chi-square, mutual information (MI), correlation coefficient, CC, and odds raito (OR.

Global feature selection (GFS) refers to the process of selecting general features for all categories and sharing a feature set during the recognition process. Local feature selection (LFS) refers to feature selection for each category. Different Categories use different feature sets, so feature selection methods can be different.

Positive feature (PF) has a strong ability to predict a certain category of the sample, that is, the sample containing this feature largely belongs to a certain category; negative feature (NF) can well predict that the sample is irrelevant to a certain category, that is, the sample containing this feature is not in a certain category to a large extent.

Accuracy of lexical analysis, phrase analysis, and Syntactic Analysis in English

The accuracy of lexical analysis, phrase analysis, and syntactic analysis is 99%, 92%, and 90%, respectively. S. b. zhao, R. grishman. extracting relations with integrated information using kernel methods. proceedings of the 43rd annual meeting of Association of Computational Linguistics (acl2005), Ann Arbor, 2005: 419-426

Extended features

(1) word forest features. 91 Mei Jia Ju. synonym forest. shanghai Dictionary Publishing House, 1996 Harbin Institute of Information Technology inspection lab. the expanded version of the same-definition word forest. 2006, http://ir.hit.edu.cn /)

(2) knowledge network features. This feature mainly refers to the definition of the current word in HowNet. It aims to overwrite similar words by using the meta interpretation provided by HowNet. Specify an encoding for each object to get the object meta interpretation code for each word. Dong Zhendong, Dong Qiang. hownet2005. http://www.keenage.com. 2005

Research on Key Techniques of Chinese event extraction (Dr. Tan Hongye)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.