On the foundation of Natural Language processing (bottom)

Last Update:2017-05-15 Source: Internet

Author: User

Tags manual writing svm knowledge base

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Named entity recognition

The presentation of named entities stems from the question of extracting structured information from unstructured texts, such as newspapers, about corporate activities and defense-related activities, and the key elements of structured information such as names, place names, organization names, time, and digital expressions, so that they need to be identified and classified in text, Named entity recognition and classification.

Since 21st century, statistical methods based on large-scale corpora have become the mainstream of natural language processing, and the following is a summary of named entity recognition methods based on statistical models:

A method of naming entity recognition based on CRF

The method of naming entity recognition based on CRF is simple and convenient, and it can obtain better performance, which is widely used in the recognition of various types of named entity such as person name, place name and organization, which can be said to be the most successful method in named entity recognition.

The basic idea is that the given text is processed first, then the names of people, simple place names and simple organization names are identified, and finally the compound place names and compound organization names are identified, and the compound refers to nested relationships.

The named entity recognition method based on CRF belongs to supervised learning method, so it is necessary to train the parameters of CRF model by using the large scale corpus which has been labeled.

In the training stage, the first step is to convert the marker of the word segmentation corpus into a marker for naming entity sequence labels. The next thing to do is to determine the feature template, the feature template is generally used in the current position of two or three positions on the string and its markers as the symbol of the model of the feature. And because different named entities generally appear in different context, the identification of different named entities (such as Chinese, Japanese, European and Russian names) generally uses different feature templates. We get feature functions from features and can be combined between different features.

After the characteristic function is determined, the remaining work is to train the CRF model parameters.

A method of named entity recognition based on multi-feature

In named entity recognition, no matter which method is used, it is an attempt to discover and take advantage of the entity's contextual characteristics and the internal characteristics of the entity, except that the particle size of the feature has a small (morphological) problem. Considering the effect of large particle size and small particle size on each other, it is necessary to take into account the use of the problem, multi-feature fusion of Chinese named entity recognition method has been proposed.

The method is to recognize the named entity on the basis of word segmentation and part-of-speech tagging, which consists of 4 sub-models: the word-form context model, the part-of-speech context model, the lexical lexical model and the speech entity Word model.

The lexical context model estimates the probability of the entity being generated in a given lexical context; the probability of the entity being generated in the context of a given part of speech. The morphological model estimates the probability of the word-string being the entity in the case of a given entity type The part of speech entity model estimates the probability of the speech string as an entity given the entity type

The performance of the system is mainly measured by the accuracy, recall and the 3 indicators of f-measure. Accuracy and recall are described in the previous article, here is the F-measure:

The F-measure synthetically considers the accuracy rate and the recall rate.

POS Labeling

Part of speech (Part-of-speech) is the basic grammatical attribute of a word, often referred to as speech. Part-of-speech labeling is the process of determining the grammatical category of each word in a given sentence, and defining its parts of speech, which is an important basic problem for Chinese processing.

A method of POS tagging based on statistical model

We can realize the method of POS tagging based on Hmm, the parameter estimation of the model is the key problem in the method of POS tagging based on Hmm, which is the third problem of Hmm, at that time we can initialize all the parameters of hmm randomly, but this will make the problem of part-of-speech tagging too limited.

Therefore, the dictionary information is usually used to constrain the parameters of the model. Suppose that the output symbol table is made up of words (that is, the sequence of words is the observed sequence of hmm), if a corresponding "lexical-part-of-speech tag" is not included in the dictionary, then the term is marked as the probability of the word tag is 0; The probability that the term is marked as a word tag is the reciprocal of all of the number of parts of speech that may be marked:

Then we based on the training corpus given the probability of word-of-speech tagging, we think about how to reasonably estimate the probability of the model according to the training corpus, for the probability of a j word-of-speech marker, the number of times that wl we use the word to be labeled as the lexical wl marker, The denominator is within the range of the training corpus, the probability that all words are marked with the word marker multiplied by the number of occurrences. jthe probability of a word-of-speech marker generation wl , which is also the following:

Not only consider the reason for the number of words, personal understanding, first of all, considering the foregoing, some words can not be output by a particular part of the word, so its product is 0, the accumulation will be ignored, and second, in estimating the model parameters, taking into account the number of the word itself may correspond to the speech tag, when calculating the generation probability Give a certain preferential treatment to the words with less word-of-speech markers, and give them a certain probability tilt.

In addition, there is a way to divide the vocabulary into a number of equivalence classes of the strategy, in the class as a unit of parameter statistics, and thus avoid the individual words to adjust the parameters, greatly reducing the total number of parameters.

Once the initialization is complete, it can be trained according to the HMM forward backward algorithm.

It is also important to note that, due to the differences in the probability of different fields of corpus, hmm parameters should change with the corpus.

This involves a problem, after adding a new corpus to the original training corpus, the parameters of the model need to be re-adjusted. According to the classical hmm theory, that is, the Hmm, which has been trained, it is difficult to play a role again, so it is hoped that when a new corpus comes in, the new and old corpus can play a role at the same time.

Only need to do some fine tuning of the Hmm forward and backward algorithm, we also remember that the forward backward algorithm needs to calculate the expected value of the transfer probability according to the initial good model, and then estimate the model parameters according to the expected value, that is, π, and aij bj(k) finally converge, we get the trained π, aij bj(k), but here we do not just save π, aij bj(k) we have to save the results of the previous step, that is, those expectations, so when the new corpus is introduced, we add the expected value of the original model and the expectation of the new corpus training, that is, the values that reflect the expectation variables of the old and new corpus, The new model can be obtained by calculation. This also solves the problem of the use of the old corpus.

Rule-based POS tagging method

The method of POS tagging based on rules is an early method of POS tagging, and its basic idea is to construct speech disambiguation rules in terms of the combination of the words (with many possible parts of speech) and the context context, and the early rules are generally written by people.

However, with the gradual increase of corpus size, it is obviously unrealistic to extract rules manually, so the automatic rule extraction method based on machine learning is proposed. As shown in the following:

The basic idea of a rule-based, error-driven machine learning method is to first use the initial state marker to identify the unlabeled text, resulting in the annotated text. Once the text is labeled and compared to the correct callout text, the learner can learn some rules from the error, thus forming a sorted rule set that corrects the annotated text and makes the callout result closer to the reference answer.

In this way, in all possible rules that have been learned, search for rules that reduce the number of errors in the annotated text to the rule set, and use the rule to adjust the annotated text and then re-rate the annotated Corpus (statistical errors). Repeat the process until there are no new rules that will reduce the number of annotated corpus errors. The final rule set is the result of the rules learned.

This method is marked faster than manual, but there is still a long learning problem, and the improvement is that in each iteration of the algorithm, only the small subset of rules that are affected are adjusted, without the need to search all the translation rules. Because every time a obtained rule is annotated with the training corpus, only a few parts of speech will change in the corpus, and only in the place where the part of speech changes, it affects the score of the rules related to the position.

A method of POS tagging combining statistical method with rule method

Some people put forward such a method of POS tagging, the basic idea is that the initial part of the Chinese sentence labeling results, first through the rule of disambiguation, excluding the most common in Chinese language phenomenon more obvious ambiguity phenomenon, and then through the statistical disambiguation, processing those remaining multi-class words and non-signed word inference, Finally, the manual proofreading to get the correct labeling results. Manual proofreading can identify problems in the system and improve them.

However, there is a big problem in this method is statistical credibility, we do not know the reliability of statistical results, always need to manually proofread all statistical disambiguation results, so some people put forward a new statistical method and rule method of the combination of POS tagging method.

The new method by calculating the word is labeled as the probability of all parts of speech, to the results of statistical annotations to give a credibility, so for all the corpus, first through statistical labeling, and then the credibility of those less than the threshold, or the probability of error is higher than the threshold value of statistical labeling results, the manual proofreading and the use of rules method for ambiguity resolution

Consistency check and automatic proofreading for POS tagging

In corpus construction, the consistency check and automatic proofreading of POS tagging are indispensable important links.

In general, there are two kinds of inconsistency of the part of speech in Corpus, one is that the vocabulary is a non-same-class word in the glossary, there is only one part-of-speech mark, but the different parts of speech are labeled in the corpus; the other is that vocabulary is the same term in the thesaurus, allowing different POS tagging Different part-of-speech tagging can be seen in the context of the annotated corpus.

The first phenomenon is better solved, the second phenomenon can be based on clustering and classification of POS labeling Consistency check method, the basic point is that the same word in a similar context should have the same part of speech. Therefore, according to the training corpus, we can calculate the mean value VA of the context vector of the same period in the same time, and then calculate the relationship between the context vector and the corresponding VA, if the same class word is labeled as each possible part of speech symbol.
The distance from VA is greater than a certain threshold H, it is considered that there may be inconsistencies in the part of the speech label.

Then is the part-of-speech annotation automatic proofreading method, in the computer automatically realizes the part of speech annotation corpus, the error situation generally divides into two kinds, one is, for the same situation, if one place goes wrong then entire article is wrong, one mistake to the end. The other is that only a partial error, which is one of the previous consistency issues, has been given a workaround.

In the case of one wrong, the basic idea of the processing method is similar, and the basic idea is to extract the part-of-speech information from the large-scale training corpus, which is labeled in the specific context context, and form a speech proofreading decision table. Here is no longer the average, for the collation of the annotated corpus, the first detection of the context of each of the words and decision table of the contextual context is matched, if matched, it is considered that the collation of the context of the same-class words and decision-making table conditions have been, then the same class of words should be consistent.

Syntactic analysis

The basic task of syntactic analysis is to determine the grammatical structure of a sentence or the dependencies between words in a sentence. Syntactic analysis is not the ultimate goal of a natural language processing task, but it is often the key link to achieve the ultimate goal.

Syntactic analysis is divided into two types: syntactic structure analysis and dependency relationship analysis. In order to obtain the syntactic structure of the whole sentence, which is called the complete syntactic analysis, the grammatical analysis with the objective of obtaining the local component is called local analysis, and the dependency analysis is referred to as the dependency analysis.

In general, there are three syntactic analysis tasks:

Determine if the output string belongs to a language
Eliminate ambiguity in input sentence morphemes method and structure
Analyze the internal structure of input sentences, such as composition, context, etc.

The 23rd task is usually the main task of syntactic analysis.

In general, it is necessary to construct a syntactic parser to consider two parts: the formal representation of grammar and the description of the term information, the formal grammatical rules constitute the rule base, the entry information is provided by the dictionary or synonym table, and the rule base and the dictionary or thesaurus form the Knowledge base of syntactic analysis. And the other part is the analytic algorithm based on the knowledge base.

Grammatical formalization belongs to the category of syntactic theory research, and it is widely used in natural language processing, such as context-free grammar (CFG) and constraint-based grammar, which is also called the Grammar of Oneness.

Simply put, the syntactic structure analysis method can be divided into two categories: rule-based analysis method and statistical-based analysis method.

The basic idea of the method of syntactic structure analysis based on rules is that the grammatical knowledge base is established by the rules of artificial organization, and the elimination of syntactic structure ambiguity is realized by conditional constraint and checking.

According to the difference of the formation direction of the parsing tree, these methods are usually divided into three types: top-down analysis method, bottom-up analysis method and combination analysis method. The top-down analysis algorithm realizes the rule derivation process, and the analysis tree grows from the root node, and finally forms the leaf node of the analysis sentence. The realization of the bottom-up analysis algorithm is the idea, it starts from the sentence symbol string, executes the process of constant specification, and finally forms the root node.

The rule-based syntactic structure analysis can analyze all possible syntactic structures of input sentences by hand-written rules, and use targeted rules to deal with some ambiguity and some hyper-grammatical (extra-grammatical) phenomena in sentences with specific fields and purposes.

But for a medium-length input sentence, it is very difficult to use the grammar rules of large coverage to analyze all possible sentence structures, and even if analyzed, it is difficult to achieve effective disambiguation, and choose the most probable analysis results; the rules of manual writing have some subjectivity, but also need to consider generalization, In the face of complex context, the correct rate is difficult to guarantee; The manual writing rule itself is a large workload of complex labor, and the rules of the field of writing is closely related, not conducive to the syntactic analysis system to transplant other areas.

The rule-based syntactic analysis algorithm can successfully deal with the compilation of programming language, while the processing of natural language is always difficult to get rid of, because the knowledge of the programming language is strictly restricted by the subclass of the context-free grammar, However, the formal description method used in natural language processing system is far more than that of the context-free grammar, and when people use programming language, all expressions must obey the requirements of the machine, and it is a process of obeying the machine, which is from the infinite set of language to the mapping process of the finite set. In natural language processing, in contrast, natural language processing realizes machine tracing and obeying human language, from the finite set of language to the process of infinite set deduction.

Full Syntax Analysis a basic analysis method based on Pcfg

the method of phrase structure analysis based on probabilistic context-independent grammar can be said to be the most successful syntax-driven statistical parsing method, which can be considered as a combination of rule method and statistical method.

Pcfg is an extension of CFG, for example:

Pcfg

Of course, the sum of the probabilities of different generative types of the same symbol is 1. NP is a noun phrase, VP is a verb phrase, pp is a prepositional phrase.

Based on PCFG's syntactic analysis model, the following three conditions are met:

Positional invariance: The probability of a subtree does not depend on the position of the word in the sentence that the subtree governs
Contextual independence: The probability of a subtree does not depend on a word outside the subtree control range
Ancestral independence: The probability of a subtree does not depend on the derivation of the ancestor node of the subtree

According to the above grammar, "He met Jenny with flowers" has two possible grammatical structures:

And we can multiply all the probabilities in the tree to get the overall probability of the two subtrees tree, and choose the more probability subtree as the best structure.

Similar to Hmm, PCFG also has three basic questions:

Given a sentence w= w1w2…wn and grammar g, how to quickly calculate the probability P (w| G
Given a sentence w= w1w2…wn and grammar g, how to choose the best structure of the sentence? That is to choose the syntactic tree T to have the maximum probability
Given pcfg G and sentence w= w1w2…wn , how to adjust the probability parameters of G, so that the probability of the sentence maximum

First, the first problem, hmm, we use forward and backward algorithms to calculate the observed sequence o probability, similar, here we use the introverted algorithm and the outward algorithm to calculate p (w| G).

First we define the introverted variable αij(A) , similar to the forward variable but different, αij(A) that is, non-terminator a derives wiw(i+1)…wj the probability of the string in W. That P (w| G) naturally equals α1n(S) , S is the starting sign, which calculates the probability that the w= of the whole sentence is derived from the starting symbol s w1w2…wn .

So as long as there is a αij(A) recursive formula to calculate P (w| G), the recursive formula is as follows:

By definition, the αii(A) nature is equal to wi the probability of the output of the symbol a αij(A) , and the idea is that the substring wiw(i+1)…wj can be cut into two parts, the first part is wiw(i+1)…wk generated by non-terminating symbol B, and the latter part wkw(k+1)…wj is generated by non-terminating symbol c, and BC is generated by a. By multiplying the probabilities in turn, a large problem can be divided into two minor problems, and the two small problems could be further divided until they cannot be divided, and then recursive to get the results.

Here is a method for calculating the inward variable:

This problem can also be solved by an extroversion algorithm.

First, the outgoing variable is defined, that βij(A) is, the initial symbol s in the process of deriving the statement w= w1w2…wn , the probability of generating the symbol string w1w2…w(i-1) a w(j+1)…wn (implies that a will generate wiw(i+1)…wj ). βij(A)that is, s derives the probability of excluding a sub-tree with a node as the root node.

"Statistical natural Language Processing (second edition)" This book is wrong, here I give my own understanding, the book gives the algorithm steps are as follows:

Very obvious error, initialization of the results are initialized, then this algorithm is what, directly equal to 1 of the end of the chant.

This is the author's understanding of the definition of extroversion variables, which gives the definition of outgoing variables, there is a word "implied a will generate wiw(i+1)…wj ", the problem is that a will generate wiw(i+1)…wj , which is the condition or inference.

Look at the meaning of the initialization of this algorithm, say β1n(A) , at the time of A=s, 1, not equal to S 0, what does it mean? Meaning is "implies a will generate wiw(i+1)…wj " This sentence is the condition, β1n(S) already implied s generation w= w1w2…wn , so-called w1w2…w(i-1) a w(j+1)…wn also does not exist, only a s->s, so the probability of natural 1.

But in the third step this place, what does the author understand to mean? The author also "implies that a will generate wiw(i+1)…wj " This sentence as a corollary, that in β1n(S) , the s will generate w= w1w2…wn is the inference, that is just right, the result of the request is S generation w= w1w2…wn , this is not the end of it, resulting in the first step of the algorithm initialization of the results are initialized.

So what is my understanding, calculated by this formula, is β1n(S) really correct, meaning is actually contains the "implied a will generate wiw(i+1)…wj " This sentence is the inference, but in the right side because of constant recursion β1n(S) , is the "implied a will generate wiw(i+1)…wj " This sentence is conditional, so there is no problem in the calculation.

I tend to add an asterisk to the third step β1n(S) to show the difference in meaning.

The book also gives an outward variable calculation method, which I think is inexplicably:

He said that βij(A) is the probability of the two cases and, this we know the j i larger, then this figure in this is k both i smaller and j bigger, this is not funny. Can only say that these two C is not a C, k nor is it a k .

Then why do I understand that, in addition to the same letter, he said in front of "must be used in the form of B->ac or B->ca rules", "the use of B->ac or B->ca two rules of the situation", which is clearly to give people in order to exchange the misunderstanding.

In addition, the use of introverted variables are inconsistent, it can be said that the book on the extroversion algorithm is very failure to explain. And the calculation of the extroversion algorithm still need to use the recursive algorithm, that really directly with the introverted algorithm, and the outgoing algorithm to define more variables.

Then there is the second question, choosing the best structure of the sentence, that is, given a sentence w= w1w2…wn and Grammar g,
Select the syntax tree that has the maximum probability. This problem is similar to Hmm, and still uses the idea of dynamic programming to solve it. Finally, the CYK algorithm is used to generate the grammatical structure tree with the maximal probability.

The third problem is given pcfg G and sentence w= w1w2…wn , how to adjust the probability parameters of G, so that the maximum probability of the sentence, and hmm relative, pcfg here the algorithm called internal and external algorithm. The same as the forward and backward algorithm, also belongs to an EM algorithm, the basic idea is that the production of G to randomly assign a probability value (to satisfy the normalization condition), to get the grammar G0, and then according to the G0 and training data, can calculate the value of each rule use number of expectations, with the desired maximum likelihood estimation, The new parameter value of the syntax G is given, the new syntax is recorded as G1, and then the probability of G's parameter converges to the maximum likelihood estimate when the process is executed.

Pcfg is just a special kind of context-independent grammar model, according to PCFG model and sentence, specifically to the sentence to do grammatical analysis, to generate a grammatical structure tree, depends on the CYK algorithm. The CYK algorithm is an algorithm used to determine whether any given string w belongs to a context-independent grammar.

There are many problems in the syntactic analysis model based on PCFG, for example, because PCFG does not model vocabulary, there are problems that are not sensitive to lexical information. Therefore, a lexical phrase structure analyzer is proposed, which effectively improves the ability of pcfg-based syntactic analyzers.

Furthermore, we have mentioned the three independence assumptions of PCFG, which also lead to the lack of structural dependencies between rules (just as the three assumptions of hmm are not entirely reasonable), whereas in natural languages, the probability of generating each non-terminator is often related to its context structure. So someone proposed a method of thinning non-terminator, for each non-Terminator label on its parent node syntax tag information.

D. Klein proposes a context-independent grammar with an implied mark (pcfg with latent Annotations,pcfg-la), which allows the non-terminator refinement process to be automated and, when optimized with the EM algorithm, improves it to avoid reaching local optimality, A hierarchical "split-merge" strategy is proposed to obtain an accurate and compact Pcfg-la model. Berkeley parser, based on Pcfg-la, is the best in the current open source phrase Structure analyzer, which is the representative of non-lexical syntactic parser, whether performance or running speed. Its syntax tree is as follows:

A case study of common syntactic tree and Pcfg-la syntax tree

This x is the implied mark, xi the range of values is generally artificial, generally take 1~16 between the integers. And the Pcfg-la is similar to the HMM model, the original non-terminator corresponds to the observed output in the HMM model, and the implied mark corresponds to the implied state in the HMM model.

Specific Pcfg-la training process, not detailed here.

Shallow syntax analysis (local grammar analysis)

It is a very difficult task for the complete parsing to determine all the syntactic information contained in the sentence and to determine the relationship between the components in the sentence. So far, all aspects of the syntactic analyzer have been difficult to achieve satisfactory degree, in order to reduce the complexity of the problem, while obtaining certain syntactic structure information, shallow syntactic analysis came into being.

Shallow parsing only requires recognizing that certain structures in a sentence are relatively simple and independent, such as non-recursive noun phrases, verb phrases, etc., which are often referred to as chunks (chunk).

The syntactic analysis is decomposed into two main sub-tasks, one is the recognition and analysis of chunks, the other is the analysis of the dependency relationship between chunks of discourse. Among them, the recognition and analysis of chunks is the main task. To some extent, shallow syntactic analysis makes the task of syntactic analysis simplified, and it also facilitates the rapid application of syntactic analysis system in large-scale real text processing system.

The basic noun phrase (base NP) is an important category in chunks that refers to simple, non-nested noun phrases that do not contain other child phrases, and that the base NP is structurally independent. Examples are as follows:

Base NP recognition is the identification of all the base NP from the sentence, according to this understanding, the composition of a sentence and simply divided into BASENP and non-base NP two classes, then base NP recognition becomes a classification problem.

There are two methods for the representation of base NP, one is the bracket separating method and the other is the IoB labeling method. The bracket separation method is to define the base NP with square brackets, the inner is the base NP, and the outer one does not belong to the base NP. In IOB notation, the letter B denotes the beginning of the base NP, I means that the current word is within the base NP, and O indicates that the word is outside the base NP.

A base NP recognition method based on SVM

Because base NP recognition is a multi-valued classification problem, and the basic SVM algorithm solves the problem of two-value classification, it is generally possible to adopt pairing strategy (Pairwise method) and the other one than the rest of the strategy (single vs. other method).

SVM generally needs to extract features from the context of word, part of speech, and base NP mark to complete the judgment. The commonly used Word window is 5 (the current word and the two words before and after it), and the best effect is recognized.

Base NP recognition method based on Winnow

Winnow is a wrong-driven machine learning method that solves the dichotomy problem, which can be quickly learned from a large number of unrelated features.

Winnow's Sparse Network (SNoW) learning structure is a multi-class classifier designed to handle large-scale learning tasks in the area of feature recognition. The winnow algorithm has the ability to deal with the high dimension independent feature space, and the feature vectors in natural language processing have this characteristic, so the winnow algorithm is often used for POS tagging, spelling error checking and text categorization, etc.

The basic idea of simple winnow is that the known eigenvector and the parameter vector and the real number threshold θ, first the parameter vectors are initialized to 1, the training sample is taken, the eigenvector and the parameters of the inner product of the parameter vector, compared with θ, if greater than θ, then the decision is a positive example, less than θ is judged as the inverse example, the result Change the weights according to the results.

If the positive example is estimated as a counter-example, then for the original value of 1 x , the weight of its value is enlarged. If the inverse example is estimated as a positive example, then for the original value of 1, the weight of the x value is reduced. Then re-estimate the weight change until the training is complete.

This actually reminds me of the LR algorithm, because the LR algorithm is also the intrinsic product of the eigenvector and the parameter vector, and finally sends it to the sigmoid function to get the result of the decision. Then more than 0.5 is a positive example, less than 0.5 is a counter-example, in fact, as long as the reverse, the SIGMOD function output 0.5 time input is the winnow algorithm in the real value of the threshold θ. But the difference is that the winnow algorithm only determines the size, does not determine the probability, and LR uses the sigmoid function to give the probability. LR uses this given probability to adjust the parameters by maximizing the generation probability of the training set, while the Winnow is a straightforward error condition to increase or decrease the relevant parameters. Visual LR Because gradient descent is used, it converges faster than winnow, while the winnow advantage is that it can handle a large number of features.

A method of base NP recognition based on CRF

The base NP recognition method based on CRF has almost the same effect as SVM method, which is better than winnow based recognition method, MEMM based recognition method and Perceptron method, and the base NP recognition method based on CRF has obvious advantage over other methods in operation speed.

The theory of dependent grammar

In natural language processing, we sometimes do not need or need not only the whole sentence of the phrase structure of the tree, but also know the relationship between the sentence morphemes and the word dependency. The framework that describes the structure of a language using the dependency relationship between words and words becomes a dependent grammar, also known as the subordinate relationship syntax. Syntactic analysis using the dependent syntax is also one of the important means of natural language comprehension.

Some people think that all structural grammatical phenomena can be summed up as the three core of association, combination and transpose. The syntactic association establishes the subordinate relation between the word and the word, which is connected by the dominant word and the subordinate word, the verb in the predicate is the center of the sentence and governs the other component, which itself is not subject to any other constituent.

The essence of the dependent grammar is a kind of structure grammar, which mainly studies the condition and condition that the deep semantic structure is reflected as the surface grammatical structure when the predicate is the center, and the relation between predicate and nominal, and divides the verb's parts of speech accordingly.

There are three types of commonly used structure-dependent representations:

Computer linguist J. Robinson proposes four axioms of the dependent syntax:

A sentence has only one independent ingredient
The rest of the sentence belongs to a certain ingredient.
No one ingredient can be dependent on two or more than two ingredients
If ingredient A is directly subordinate to ingredient B, and ingredient c is between A and B in the sentence, then the composition C either belongs to the component A, or is subordinate to B, or is subordinate to A and B.

These four axioms are equivalent to the form constraints of the dependent and dependent trees: single parent, connected, non-cyclic, and predictable, thus guaranteeing that the result of a sentence's dependency analysis is a tree structure with roots.

Here you can cast, if the dependent arcs between the words are drawn without any crossover, they can be projected (refer to the two graph above).

For the sake of understanding, our scholars put forward 5 conditions that the dependent tree should meet:

Simple junction conditions: only endpoints, no non-endpoints
Single parent node condition: The root node has no parent nodes, all nodes have only one parent node.
Single root node condition: A dependent tree can have only one root node, which governs the other nodes.
Non-intersecting conditions: tree branches of the dependent trees cannot intersect each other
Mutually exclusive conditions: from the top to the bottom of the dominant relationship and from left to right before the relationship is mutually exclusive, if there is a dominant relationship between the two nodes, they can not exist in the previous relationship

These five conditions are overlapping, but they are completely from the spatial structure of the dependent expression, which is more intuitive and practical than the four axioms.

Gaifman 1965 has given a formal representation of the dependent syntax, proving that the dependent grammar is no different from the context-free grammar.

Language forms, such as context-free grammars, restrict the projective nature of the language being analyzed, and it is difficult to deal directly with the language of the non-projective phenomenon. The the 1990s developed the constraint syntax and the corresponding dependency analysis method based on constraint satisfaction, which can deal with this kind of non-projective language problem.

The analysis method based on constraint satisfaction is built on the constraint dependency syntax, and the dependent syntactic analysis is regarded as a finite structure problem which can be described by the constraint satisfying problem.

Constrained dependency syntax uses a series of formalized, descriptive constraints to remove a dependency analysis that does not conform to the constraint until a valid tree of dependencies is left.

The generated dependency analysis method, discriminant dependency analysis method and deterministic dependency analysis method are three representative methods in data-driven statistical dependency analysis.

Generative Dependency Analysis method

The generated dependency analysis method uses the joint probabilistic model to generate a series of dependent syntax trees and assigns the probability scores, then uses the correlation algorithm to find the highest probability scoring results as the final output.

The built-in dependency analysis model is convenient to use, and its parameters are only used in training to find the count of related components and to calculate a priori probability. However, the generation method uses the joint probability model, then makes the approximation hypothesis and the estimate when the probability multiplication integral solution, moreover, because uses the global search, the algorithm complexity is high, therefore the efficiency is low, but this kind of algorithm has the certain superiority in the accuracy rate. But the reasoning method similar to the CYK algorithm makes this kind of model difficult to deal with non-projective problems.

Discriminant-dependent Analysis method

The discriminant dependency analysis method uses conditional probabilistic model to avoid the independence hypothesis required by the joint probabilistic model (considering the discriminant model CRF discards the hypothesis of the independence of the model Hmm), and the training process is looking for the maximum parameter θ (similar to logistic regression and CRF) to make the objective function (training sample generation probability).

Discriminant method not only exhaustive search in reasoning, but also has global optimality in training algorithm, it is necessary to repeat the syntactic analysis process to iterate parameters in training instance, the training process is also the inference process, and the time complexity of training and analysis is consistent.

Deterministic dependency methods

Deterministic dependency analysis method takes a word that is to be analyzed in a particular direction, producing a single analysis result for each word entered, up to the last word of the sequence.

This type of algorithm in every step of the analysis according to the current analytical state to make decisions (such as to determine whether it is dependent on the previous word), therefore, this method is also called the decision-making analysis method.

A definite parsing action sequence is used to obtain a unique syntactic expression, that is, the dependency graph (which may sometimes have backtracking and patching), which is the basic idea of deterministic syntactic analysis method.

The relationship between phrase structure and dependent structure

The phrase tree can be converted to a dependency tree by one by one, or vice versa. Because a tree of dependencies may correspond to multiple phrase trees.

Semantic analysis

For different language units, the task of semantic analysis is different.

At the level of words, the basic task of semantic analysis is to make sense disambiguation (WSD),

On the sentence level is the semantic role Callout (SRL),

At the textual level , it refers to the disambiguation, also called the common-finger digestion.

Word sense disambiguation

Since words are the smallest language units that can be used independently, the meaning of each word in a sentence and its interaction in a particular context make up the meaning of the whole sentence, therefore, the word sense disambiguation is the basis of the sentence and text semantic comprehension, and sometimes the meaning disambiguation is also called the word meaning annotation. Its task is to determine the specific meaning of a polysemy in a given context.

The method of Word sense disambiguation is divided into supervised disambiguation methods and unsupervised disambiguation methods, in which the training data is known, that is, the meaning of each word is labeled, and in unsupervised disambiguation methods, the training data is unlabeled.

The lexical recognition of polysemy is actually the problem of the context classification of the word, remember the process of speech consistency recognition, and also judge the word's part of speech according to the context of the word.

Supervised word-sense disambiguation complete classification tasks based on context and labeling results. While unsupervised word sense disambiguation is often referred to as clustering tasks, the equivalence classes of all contexts of the same polysemy are used in the clustering algorithm, and the context of the word is compared with the equivalence classes of each meaning context in the sense of word recognition, and the meaning of the word is determined by equivalence class corresponding to the context. In addition, there is a dictionary-based disambiguation method in addition to supervised and unsupervised word sense disambiguation.

In the study of word-sense disambiguation, we need a lot of test data, in order to avoid the difficulty of manual labeling, we use the method of artificial manufacturing data to obtain large-scale training data and test data. The basic idea is to combine two natural words and create a pseudo-word to replace all the original words that appear in the corpus. The text with the pseudo-word as the ambiguous original text, the original text as the disambiguation text.

The method of Word sense disambiguation with supervised

The supervised Word sense disambiguation method distinguishes the meaning of polysemy by establishing a classifier and using the method of classifying polysemy context categories.

Disambiguation methods based on mutual information

The basic idea of disambiguation based on mutual information is to look for a contextual feature for each polysemy that needs disambiguation, which can reliably indicate which semantics the polysemy uses in a particular context context.

Mutual information is the correlation between the two random variables x and y, and the greater the X-y association, the more relevant the mutual information is.

Here is a brief introduction to the flip-flop algorithm used in machine translation, the algorithm is suitable for such conditions, a language has a word, which itself has two meanings, after the B language, there are more than two translations.

We now have a variety of translations of the word in B, as well as the contextual features corresponding to each translation.

What we need to get is the translation of the B language corresponding to the 1, which corresponds to the 2.

The complex point of this problem is that for ordinary word sense disambiguation, such as the two-key polysemy, the word is the same, the context is many, we divide these contexts into two equivalence classes, and this cross-language, not only to solve the context of the Division, before this also to solve the two sense of multi-word translation division.

The most troublesome thing is to first find the two kinds of meanings corresponding to the word translation, and the two meanings of the corresponding word translation corresponding to the context characteristics, and their corresponding relationship.

Imagine that there are two circles on the ground, representing two senses, and in these two circles, there are several balls representing each of the word translations for each key, and then there are several squares in the two circles that represent the corresponding context for each of the senses in that language. Then the ball and the block between the wired connection (ball and ball, block and block between the box), casually connected, the ball can be connected to multiple blocks, blocks can also be multiple balls. Then, the circle is gone, two laps of the ball and the box are mixed together, the mess, you how to belong to the two circle of the ball and block separate.

The method given by the flip-flop algorithm is to try it out. Divide the block into two sets, the ball is divided into two sets, and then see how the situation, if the situation is not good, continue to try to find the best division. Then the problem that needs to be solved is, how to determine the good points? with mutual information.

If the mutual information between the two upper and Lower anthology (block set) and the two word translation set (the ball set) is large, then we think that they are related to each other, and it is closer to the original two senses.

In fact, this method based on mutual information directly divides the meaning of word translation into good.

A disambiguation method based on Bayesian classifier

The idea of the disambiguation method based on Bayesian classifier is the same as the naïve Bayesian classification algorithm in the basic machine learning, which is used to determine the spam and normal mail, here is used to determine the different items (the number can be greater than 2), we only need to calculate the given context, the most probability of the word is good.

According to the Bayesian formula, in both cases, the denominator can be ignored, to calculate the numerator, the largest molecule, in the spam identification, the molecule is P (the current message appears in the word | spam) P (spam), then multiply is spam and the current message word occurrence of the joint distribution probability, Normal mail is the same; here the molecule is P (the context in which the current Word exists | A certain item) p (a certain item), so that the probability of the joint distribution of a certain sense and context is computed, divided by the denominator P (the context in which the current word exists), the result is P (a certain sense | The context in which the current word exists), it is possible to derive the most probable meaning from the context.

A method of Word sense disambiguation based on maximum entropy

The basic idea of using the maximum entropy model for word-sense disambiguation is also to think of word-sense disambiguation as a classification problem, that is, to determine the meaning of a word in terms of its specific contextual conditions (denoted by its characteristics).

A method of disambiguation based on dictionary semantic definition for dictionary-based word sense disambiguation

M. Lesk that the definition of an entry in a dictionary can be a good condition for judging its meaning, such as the core in English, which has two definitions in the dictionary, one is "the cone of the pine", and the other refers to "cones for holding other things, such as cone-shaped pancakes with ice cream". If there is a "tree" in the text, or "ice" appears, then the meaning of the core can be determined.

We can choose the most relevant meanings by calculating the definitions of the different senses in the dictionary and the similarity of the words in the context of the text.

Disambiguation methods based on semantic-type dictionaries

And the former based on the dictionary semantics of the disambiguation method is similar, but the use of the definition of the dictionary is not the text, but the whole meaning of the use of the semantic class, such as anminal, machinery, and so on, different contextual semantic classes have different co-existing words, relying on this to the polysemy of the sense of disambiguation.

Unsupervised method of Word sense disambiguation

Strictly speaking, it is impossible to use the completely unsupervised disambiguation method to label the word meaning, because the word meaning should provide some descriptive information about the semantic features, but the word sense recognition can be realized by using a completely unsupervised machine learning method.

The key idea lies in the context clustering, and the similarity of the contextual vectors appearing in polysemy can be realized by the context clustering, thus realizing the word meaning distinguishing.

An overview of semantic role labeling

Semantic role labeling is a shallow semantic analysis technique, which is based on sentence and does not analyze the information contained in the sentence, but only the predicate-structure of the sentence. Specifically, the task of semantic role labeling is to focus on the predicate of the sentence, study the relationship between the components and predicates in the sentence, and describe the relationship between them in terms of meaning roles. Like what:

In fact, it's a slot, find the time, place, agent, patient, and core predicate in the sentence.

At present, the semantic role labeling method relies too much on the results of syntactic analysis, and the field adaptability is too poor.

Automatic semantic role labeling is based on syntactic analysis, while syntactic analysis includes phrase structure analysis, shallow syntactic analysis and dependency analysis, therefore, semantic role labeling method is also divided into semantic role labeling method based on phrase structure tree, Semantic role labeling method based on the result of shallow parsing and semantic role labeling method based on the result of dependent parsing three kinds.

Their basic processes are similar, and in the study it is generally assumed that predicates are given, and all that is needed is to find the individual elements of a given predicate, that is, the task is determined, and the values of the individual slots needed for the task are identified. The process is generally comprised of 4 stages:

The goal of the candidate argument pruning is to cut off the number of candidates from a large number of candidates that are unlikely to be argument.

The task of the meta-identification phase is to identify the real argument from the candidate after pruning. The meta-recognition is usually solved as a binary classification problem, that is to judge whether a candidate is a real argument. This stage does not need to annotate the semantic role of the argument.

On the meta-labeling stage to identify the semantic role of the meta-annotation in the previous stage. Meta-labeling is usually solved as a multi-value classification problem, and its category collection is all the semantic role tags.

Finally, the role of the post-processing stage is to deal with the results of the previous semantic role labeling, including removing the argument of semantic role repetition.

Semantic role labeling method based on phrase structure tree

First is the first step, the candidate argument is cut off, the concrete method is as follows:

The predicate is used as the current node, and its sibling nodes are examined in turn: If a sibling node is not tied to the current node in the syntactic structure, it is considered a candidate. If the syntactic tag of the sibling node is a prepositional phrase, all its child nodes are candidates.
Sets the parent node of the current node as the current node, repeating the previous step until the current node is the root node of the syntax tree.

For example, the candidate argument is a circle on the graph:

After pruning the candidate argument, we enter the stage of the meta-recognition, and select the effective feature for the classifier. People have summed up some common effective features, such as predicate itself, path, phrase type, position, voice, central Word, subordinate category, the first word and the last word of the argument, the combination feature and so on.

Then the meta-annotation, here also need to find some corresponding features. Then post-processing is not required.

Semantic role labeling method based on dependency tree

The semantic role labeling method is based on the dependency analysis tree. Because the phrase structure tree is different from the dependent tree, the semantic role labeling methods based on the two are different.

In the semantic role labeling method based on the phrase structure tree, the argument is expressed as a continuous number of words and a semantic role tag, such as the "cause of the accident" shown above, and the two words together as a meta A1; in the semantic role labeling method based on the dependency tree, a meta-argument is represented as a central word and a semantic role tag. For example, in the dependency tree, the "reason" is the central word of "accident", so long as the "reason" is A1 argument, that is, predicate-argument meta-relation can be expressed as the relationship between predicate and the central word of the meta-argument.

An example is given below:

Above the sentence is the original dependency tree, below the sentence is the predicate "investigation" and its various elements of the relationship between.

The first step is still the meta-cut, the concrete method is as follows:

To use a predicate as the current node
All child nodes of the current node as candidates
Sets the parent node of the current node as the current node, and if the new current node is the root node of the dependent syntax tree, the pruning process ends, if not, the previous step

The meta-recognition and meta-labeling are still based on the classification of features, and there are some common features that people have summed up. Not detailed here.

Semantic role labeling method based on block language

As we know, the result of shallow parsing is the base NP label sequence, one of the methods used is IOB notation, I means base NP word, o denotes the word, B is the first word.

Semantic role labeling method based on block is used to solve semantic role annotation as a sequence labeling problem.

Semantic role labeling method based on chunks generally, there is no argument to subtract this process, because O is equivalent to having cut off a large number of non-base NP, which is impossible to be the content of the argument. The meta-identification is usually not needed, and the base NP can be considered as a meta-argument.

What we need to do is to annotate the meta-notation, labeling all the base NP good semantic roles. Compared with the semantic role labeling method based on the phrase structure tree or the dependency tree, the semantic role labeling based on the block is a relatively simple process.

Of course, because there is no tree structure, only the normal sequence, compared with the first two structures, lost a part of the information, such as dependencies.

A fusion method of semantic role labeling

Due to the serious dependence of semantic role labeling on the results of syntactic analysis, the errors produced by syntactic analysis will directly affect the results of semantic role labeling, and the fusion of semantic role labeling system is an effective method to reduce the effect of syntactic analysis errors on semantic role labeling.

The system fusion is to combine the results of multiple semantic role labeling systems, and to obtain the best result by using the difference and complementarity between different semantic role labeling results.

In this method, the semantic role labeling results of several different semantic roles are given, and the results of semantic role labeling are obtained, and then the correct parts of each semantic role labeling result are combined by fusion technology to obtain a correct semantic role labeling result.

Fusion method This paper briefly introduces a semantic role labeling Fusion method based on integer linear programming model, which needs the probability of each argument to be output by a fused system, and the basic idea is to treat the fusion process as an inference problem, and establish a constrained optimization model. The goal of optimization is generally to make the final semantic role in the results of all the elements of the probability of the sum of the largest, and the constraints of the model is generally derived from people based on linguistic laws and knowledge of the experience summed up.

Besides the fusion method based on integer linear programming model, some other fusion methods, such as the least error weighted system fusion method, are also studied. The basic idea is that we should not treat all the fused labeling results equally, we should rely more on the overall result better system when we are merging.

As follows:

System Fusion process with minimum error weighting

On the basis of natural language processing to this end, more than 42,000 words, even reading and writing articles took one months, 8 hours a day, finally finished, tired to No.

On the foundation of Natural Language processing (bottom)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More