Description
Tutorial source code directory in Book/label_semantic_roles, first use please refer to the Paddlepaddle installation tutorial, for more information please refer to this tutorial in the video class. Background Information
Natural language analysis technology is roughly divided into three levels: lexical analysis, syntactic analysis and semantic analysis. Semantic role tagging is a way to realize the semantic analysis of shallow layers. In a sentence, a predicate is a statement or description of the subject, pointing out "What to do", "what" or "how", represents the core of an event, the noun with which the predicate is called the argument. The semantic role refers to the role of the argument in the event of the verb. Mainly: Agent, patient (Patient), Object (Theme), Experience (experiencer), beneficiary (beneficiary), tools (instrument), premises (Location), Target (Goal) and sources (source) and so on.
See the example below, "encountered" is a predicate (predicate, usually abbreviated as "Pred"), "Xiaoming" is the Agent, "Little Red" is patient (Patient), "Yesterday" is the time of the event (Times), "Park" It's where it happened (Location). [Xiao Ming] agent[yesterday]time[evening]time in [Park]location[met]predicate [Xiao Red]patient. \mbox{[xiaoming]}_{\mbox{agent}}\mbox{[yesterday]}_{\mbox{time}}\mbox{[night]}_\mbox{time}\mbox{in [Park]}_{\mbox{location}}\mbox{ [Encountered]} _{\mbox{predicate}}\mbox{[Xiao Red]}_{\mbox{patient}}\mbox{. The Semantic role Annotation (semantic) is centered on the predicate of the sentence, does not analyze the semantic information contained in the sentence, only analyzes the relationship between the components and the predicate in the sentence, that is, the predicate of the sentence (predicate)- It is an important intermediate step for many natural language understanding tasks (such as information extraction, textual analysis, depth question and answer, etc.) to describe these structural relationships by the Argument structure and the semantic role. In the study, it is generally assumed that predicates are given, and all that is needed is to find the various elements of a given predicate and their semantic roles. Most of the traditional SRL systems are based on syntactic analysis, which usually includes 5 processes: 1. To construct a parsing tree, for example, figure 1 is a syntax tree that is derived from the syntax analysis of the above example. 2. The candidate argument for a given predicate is identified on the clause tree. 3. The candidate argument is cut off; a candidate argument in a sentence may be many, and the candidate is cut off from a large number of candidates who are least likely to become argument. 4. Meta-identification: This process is to determine from the candidate after the last cut which is the real argument, usually as a two classification problem to solve. 5. For the result of step 4th, the semantic role tag of the argument is obtained through multiple classifications. As you can see, syntactic analysis is the basis, and the subsequent steps often construct some artificial features that often come from syntactic analysis.
Figure 1. Syntax Tree example of dependency parsing
However, complete syntactic analysis needs to determine all the syntactic information contained in the sentence, and to determine the relationship between the components of the sentence is a very difficult task, the current technology of the syntactic analysis of the accuracy rate is not high, syntactic analysis of the subtle errors will lead to SRL errors. In order to reduce the complexity of the problem and obtain some syntactic structure information, the idea of "shallow syntactic analysis" came into being. Shallow syntactic analysis is also known as partial syntactic analysis (partial parsing) or chunk partitioning (chunking). and complete syntactic analysis to get a complete syntactic tree different, shallow parsing only need to identify some of the relatively simple structure of the sentence independent components, such as: verb phrases, these recognized structures are called chunks. In order to avoid the difficulty of "unable to obtain the syntax tree with higher accuracy rate", some studies [1] also propose an SRL method based on chunk (chunk).
Based on the Srl method, the SRL is solved as a sequence tagging problem. The sequence callout task generally uses the bio representation to define the tag set of the sequence callout, and we'll introduce this representation first. In the bio notation, B represents the beginning of the chunk, I represents the middle of the block, and O stands for the end of the block of words. The different chunks are given different labels by B, I, O three kinds of tags. For example, for an argument with a role of a, the first chunk it contains is given a label b-a, the other chunks it contains are given a label i-a, and the chunk that does not belong to any argument is given the label O.
Let's continue with the above sentence for example, Figure 1 shows the bio representation method.
Figure 2. Example of a bio tagging method
From the above example, we can see that it is a relatively simple process to get the semantic role annotation results directly from the sequence tagging results. This simplicity is embodied in the following: (1) reliance on shallow syntactic analysis reduces the requirements and difficulties of syntactic analysis; (2) There is no candidate to cut off this step; (3) The identification of the element and the annotation of the argument are realized simultaneously. This integration of the theory element recognition and the method of meta tagging simplifies the process, reduces the risk of error accumulation, and often can achieve better results.
Similar to the Srl method based on chunks, in this tutorial we also consider SRL as a sequence tagging problem, except that we rely on input text sequences, do not rely on any additional parsing results or complex man-made features, and use a deep neural network to construct an End-to-end learning SRL system. As an example of the open dataset of the SRL task in CoNLL-2004 and CoNLL-2005 Shared tasks, we practice the following tasks: Given a sentence and a predicate in this sentence, the predicate corresponding argument is found in the sentence by the way of sequence annotation, Annotate their semantic roles at the same time. Model Overview
Cyclic neural network (recurrent neural network) is an important model for sequence modeling, which is widely used in natural language processing tasks. Unlike Feedforward neural networks (Feed-forward neural network), RNN can handle the problem of the correlation between inputs. LSTM is an important variant of RNN, which is often used to learn the long range dependence in the sequence, which we have already introduced in the affective analysis, in which we still use LSTM to solve the SRL problem. stack-type cyclic neural network (stacked recurrent neural network)
Deep networks contribute to the formation of hierarchical features, and the upper layers of the network form more complex advanced features based on the primary features that have been learned in the lower layers. Although lstm is equivalent to a very "deep" feedforward network when it is expanded along the timeline, however, due to the lstm of each time step parameter sharing, the mapping of the t−1 t-1 time state to T T time is always only through a non-linear mapping, which means that the modeling of the single layer lstm to the state transition is "shallow". Stacking multiple LSTM units so that the output of the previous lstm T-t moment becomes the input of the next lstm cell T-t moment, helping us build a deep network, which we call the first version of the stack-loop neural network. Deep networks improve the ability of models to fit complex patterns, and can better model models across different time steps [2].
However, it is not easy to train a deep lstm network. Vertically stacking multiple LSTM units may encounter problems in which gradients propagate at the longitudinal depth. In general, stacked 4-layer lstm unit can be trained normally, when the layer reaches the 4~8 layer, there will be performance attenuation, then must consider some new structure to ensure the gradient longitudinal smooth propagation, this is the training deep LSTM network must solve the problem. We can draw on the wisdom of lstm to solve the problem of "gradient vanishing gradient explosion": There is no non-linear mapping on the route of the information propagation of the memory cell (Memory cell), and it will not decay or explode when the gradient is reversed. Therefore, the deep lstm model can also add a path to ensure the smooth propagation of gradient in the longitudinal direction.
The operation of a lstm unit can be divided into three parts: (1) input to the hidden layer Mapping (Input-to-hidden): Each time step input information x x will first pass through a matrix mapping, and then as the forgotten door, input door, memory unit, output gate input, note that This mapping does not introduce nonlinear activation; (2) mappings of hidden layers to hidden layers (Hidden-to-hidden): This step is the main body of the LSTM calculation, including the forgotten Gate, the input gate, the memory unit update, the output gate calculation (3) hidden layer to output mapping (Hidden-to-output): It is usually simple to activate a hidden layer vector. On the basis of the first version of the stack network, we add a new path: In addition to the previous layer of lstm output, the input of the front layer lstm to the hidden layer as a new input, while adding a linear mapping to learn a new transformation.
Fig. 3 is a schematic diagram of the stack-loop neural network that is finally obtained.
Figure 3. A lstm-based stack cyclic neural network structure diagram bi-directional cyclic neural network (bidirectional Recurrent neural network)
In Lstm, the hidden layer vector of T-t time encodes all input information until T-t moment, but T-t-time lstm can see history but cannot see the future. In most natural language processing tasks, we almost always get the whole sentence. In this case, if you can get the information of the future as the historical information, the Sequence learning task will be greatly helped.
In order to overcome this shortcoming, we can design a two-way loop network unit, its idea is simple and direct: to the previous section of the stack Loop neural network for a small modification, stacking multiple lstm units, so that each layer of LSTM units to: Forward, reverse, forward ... In order to learn the output sequence of the upper layer. So, from the 2nd floor, our LSTM unit will always be able to see history and future information. Fig. 4 is a schematic diagram of the bi-directional cyclic neural network based on LSTM.
Figure 4. Structure sketch of bidirectional cyclic neural network based on LSTM
What needs to be explained is that this bidirectional RNN structure and Bengio are used in the machine translation task of the bidirectional RNN structure [3, 4] is not the same, we will in the subsequent machine translation task, introduce another bidirectional cyclic neural network. Conditional Random Airport (Conditional Random Field)
The idea of using the neural network model to solve the problem is usually as follows: the characteristics of the front-layer network learning input, the last layer of the network completes the final task based on the feature. In the SRL task, the Deep lstm network learning Input feature indicates that the conditional following the airport (Conditional Random Filed, CRF) completes the sequence annotation on the basis of the feature, and is at the end of the whole network.
CRF is a probabilistic structural model, which can be regarded as a probabilistic, non-direction graph model, where nodes represent random variables, and edges represent probabilistic dependencies between random variables. In simple terms, CRF learning condition probability P (x| Y) P (x| y), where x= (x1,x2,..., xn) X = (x_1, x_2, ..., x_n) is the input sequence, y= (y1,y2,..., Y