Welcome reprint, Reprint annotated Source:
Http://www.cnblogs.com/NeighborhoodGuo/p/4702932.html
Go Go Go
The tenth lecture also successfully concluded, is worthy of the advanced Recursive nn said content is indeed some advanced, but if serious lectures and then seriously see after class paper words, I believe can still fully understand.
Start summarizing ...
First of all, the teacher in class began to talk about the rnn of a review, but also use RNN for NLP three large elements: 1.Train Objective There are two main, one is cross-entropy or Max-margin 2.Composition Function, this talk is mainly about this content 3.Tree structure this in the previous talk in detail, chain structure or balance Tree
This speaks of four models:
The application of the second Matrix-vector Rnns in relation classification was not seen in paper, but only in the course. Therefore, this talk is mainly about the application of each model in paraphrase detection and sentiment analysis.
All right, from the top down (--)
1.Standard Rnns
The Rnns for paraphrase detection consists of two main aspects. The first aspect is recursive Autoencoder, and the second is the neural Network for variable-sized Input
This model is very detailed in the second article paper.
1.Recursive Autoencoder
First we have a parse tree, and a trusted parse tree is important for paraphrase detection.
There are two ways to Autoencoder Recursive Autoencoder.
The first is the left side of this, each time decoder only decoder out a layer, and then ask all non-terminal nodes error and as loss function. The error of Non-terminal nodes is that its two children vectors is connected first, then the European distance is obtained.
In the above two, where the T set in the following equation is all non-terminal nodes, the C1 in the above equation, C2 is the two children of P respectively
Since the value of the non-terminal nodes can be achieved by infinitely shrinking the norms of the hidden layers, the Non-terminal nodes value P must be normalization
The second is the one on the right side of the picture above reconstruct the entire spanned subtree underneath each nodes
is to decode all the leaf nodes, and then connect all the leaf nodes, and ask the Euclidean distance as the loss function.
2.Neural Network for variable-sized Input
The model tune above is good, then the next stage.
These are the tune tree.
First set up a similarity matrix, where column and row are sorted in the left-to-right order of the words and the upper hidden nodes from bottom to top from right to left
The second step is pooling, where the pooling layer is a square matrix that first takes a fixed value for #col and #row and sets it to n_p. The non-over-lapping method used in this paper is to not overlap rows or columns.
If #col > N_p, #row > N_p, every #col/n_p and #row/n_p as a pool will end up with a smaller pool row or column smaller than n_p.
If #col < N_p, #row < N_p, first duplicating pixels is less than n_p, and then until pixels on that side is greater than n_p.
Take its minimum value in each pool, then normalize each entry after the pool so that it mean = 0 and variance = 1
Paper mentions an improved method for numbers: First, if the number in the two sentences is exactly the same or no number, set to 1, and vice versa to 0; the second if two sentences contain the same number, set to 1; The number in the third sentence is strictly a subset of the numbers in the other sentence, set to 1
There are two drawbacks to this approach: the first is to simply compare the similarity of words or phrase and lose the grammatical structure, and the second is to calculate the similarity matrix and omit some of the information.
Finally, the obtained similarity matrix input into an NN or softmax classifier to establish the loss function can be optimized calculation.
2.matrix-vector Rnns
Matrix-vector Rnns This model is relatively simple is in the expression of a word, not only in the form of vectors, with the matrix vector combination of the way to represent a word
The above is the Matrix-vector expression method, and the Stardard difference is relatively small.
In class, this model has a good effect on relationship classification.
Relationship classification simply says it's like a former high school that extracts keywords from a sentence.
3.RNTN
Bag-of-words Way to sentiment detection comparison is not reliable, because bag-of-words can not capture a sentence of the parse tree and linguistic features
The use of good corpus will also improve accuracy, very tempting oh!
In fact, the overall model changes are not too big, but also very good understanding, so that can be very good to capture the sentence of the sentiment
The optimization of this model is slightly different from the previous one:
This model is said to be the only model that can capture negation and its scope today.
4.Tree Lstms
The difference between the tree Lstms and the ordinary Lstms is that the tree Lstms is modeled Lstms from the tree's structure.
Ordinary Lstms can also be seen as a special case of tree Lstms.
The hidden calculation of leaf node in Tree Lstms is the same as that of the previous normal hidden calculation, except that its parents calculation is slightly different. See the concrete formula.
The hidden of the parent is the sum of its children hiddens, and each forget unit is calculated from a specific node, and the final cell is calculated by multiplying and summing all forget units and the corresponding cells. The other is the same calculation method as the ordinary Lstms.
This model is most suitable for semantic similarity at present.
CS224D Lecture 10 Notes