Semantic understanding and Word vectors based on rnn

Source: Internet
Author: User
Tags abstract time interval advantage

Semantic understanding based on RNN

                                       

1. Preface

The article translated in this article is

Grégoire Mesnil, Xiaodong He, Li Deng and Yoshua bengio-investigation of recurrent neural Network architectures and Lear Ning Methods for spoken Language understanding

The Code home page is HTTPS://GITHUB.COM/MESNILGR/IS13.

This article describes some of the RNN frameworks including Elman-type Rnn,jordan-type RNN and some variants. To deal with the semantic understanding, because the author of the language of the word encoding, it can also help to get word vectors. Based on the theano,ats (Airline Travel information System) standard, the author discovers two-way jordanrnn, which is better than the F1 of CRF, and improves the relative accuracy by 14%.


2. Public Language task evaluation (the Slot Filling Task)

This task is probably a name for each word of the entity classification, such as the word is the beginning of a place name, the word is the end of a person's name, and so on. You can see the task of a sequence classification.



Previous solutions are CRF, of course, need to do the characteristics of their own projects, one hears that they want to design features engineering, know that this is not deep learning style.


3. To do Slot Filling based on Rnns

3.1 Word Embedding

Word embedding is not related to deep learning, but the idea of word embedding, an alternative to modeling N-gram language, can significantly improve generalization capabilities in many NLP tasks. This kind of thought is in fact a bit similar to transfer learnning, also is because of the DL in the language of the evidence to continue to abstract the idea of learning. Word vectors can be trained through a variety of neural network models, including shallow neural networks, convolutional neural networks, and recurrent neural networks.

H. Schwenk and J-l gauvain, "Training neural networklanguage models on very large corpora," in hlt/emnlp2005.

R. Collobert, J. Weston, L. Bottou, M. Karlen, K.kavukcuoglu, and P. Kuksa, "Natural language Processing (almost) from SCR Atch, "in Journal of Machinelearning, vol. 12, 2011.

T. Mikolov, Stefan kombrink, Lukas Burget, Jancernocky, and Sanjeev Khudanpur, "Extensions of recurrent neural network bas Ed language model, "in ICASSP 2011.


3.2 Capturing short-distance dependencies based on small Windows


The so-called short-distance dependence, refers to the language modeling, can be a word near the word (usually fixed window) modeling, to determine the attributes of a word. This approach is also the main way of traditional NLP, but also the most important shortcoming. Because the model is difficult to capture long-distance dependence, so-called long-distance dependence is what we call the context. Like "I love apples", if there is a context, you know whether I like the Apple phone, or eat apples. It is difficult for us to judge based on this sentence alone.

Short distance modeling is relatively easy, and basically continues the previous routines, especially R. Collobert. Natural language Processing (almost) from scratch.



Since long-distance dependence is useful for language modeling, how can you design models to capture this relationship? rnn!


two architectures for 3.3 rnn

The fighters used in this article are the most classic of the RNN architecture, which is much smaller than the complexity of lstm. The basics of RNN can be found in other blog posts. The key to RNN is to have a concept of time delay by using the hidden layer (Elman type) or the output layer (Jordan type) as input to the next moment.

The following figure is a typical Elman type 3-layer rnn.


The expression in the formula is:



3.4 Using RNN to capture long-distance dependencies

In order to have a dependency relationship between words outside of the input window, we need to use the concept of delay feedback. But in general, using RNN to learn long-distance dependence, the biggest problem is the optimization problem, that is, the vanishing gradient. In fact, the training rnn is basically consistent with the training depth model, in particular we can open the RNN according to time, and the result is a deep neural network. To solve this problem, this article takes a little bit of skill or distortion at this point.

This paper takes the information directly from the past several stages as input. At this point the author does not take a word a time interval, but the past T word as a time interval, which is actually similar to the idea of the window. To say so much may feel abstract, in fact the formula is as follows:


Note the difference between the above standard RNN, the key is that there is a cumulative T-time, of course, each one corresponds to a time lag, also because of the parameters of the model increase.

In the extended variant of RNN, there is a model called Bi-directional RNN, which can take advantage of the information of the past and the future. That is, the input to T-moment is not only the hidden layer (output layer) from t-1, but also the information from T+1. This is actually the use of the past and future information.



3.5 Model Training

Fine tuning of 3.5.1 word vectors

The author compares the effects of the word vectors directly given to random word vectors and different methods (corpus, training model, Word vector dimension) on the model. It is better to find or senna a vector of words, or better than a random word vector.

3.5.2 The gradient of sentence and word levels

In fact, the random gradient learning Mini Batch set width problem, the author found that the length of a sentence is set to the best performance

regularization of the 3.5.3 dropout

About dropout regularization is not the focus of this article, but the authors found that when training bi-directional RNN, they often cross-fit, and therefore do not perform well on test sets. So how to regularization the model to improve the effect of the model, rather than let the wonderful model because not trained and buried.


4. Experimental results

The author compares the logistic regress,mlp,crf,rnn.

In which the logistic regress uses to what kind of characteristic project, the author in the article did not mention too much, only said only uses the lexical features, namely the lexical characteristic. This is the place where I feel doubtful.

Logistic Regress+wiki. The author argues that when we use the RNN word embedding, we actually take advantage of the wiki's exogenous knowledge, which is not fair for crf,logistic regress, so the author of the wiki is clustered into 200 categories, with an ID for each of the categories to which the word belongs. The class feature is then added to the crf,logistic regress as a discrete feature, which of course provides the model.

Frame-nn should be MLP, the simplest multi-layer Perceptron

CRF is the condition with the airport, his feature engineering is also simple lexical features, but it is a sequence model, so unlike the above two models, this model out of everyone with the same feature engineering, the model because it can consider sequence-related dependencies, so have some natural advantages, Of course, this is also confirmed by the experimental results.

Here's a comparison between a bunch of rnn.

ELMAN-RNN (past) is a ELman type of RNN that leverages the information of the past.

ELMAN-RNN (future) is the ELman type of information that takes advantage of the next RNN

JORDAN-RNN (past) is a type of Jordan that uses the information of the past RNN

JORDAN-RNN (future) is a Jordan type that leverages the information of the next RNN bi-dir Jordan RNN is a two-way Jordan type RNN

In general, the sequence model is better than the non-sequential model, in the sequence model, RNN is better than CRF, in RNN, the use of past information is better than the future information, Jordan is better than Elman. Jordan RNN best to use past and future information to perform best, f1=93.98. (Seems fine)


5. Summary

There are also some questions such as LR,CRF's characteristics, which can only be found by code.

Rnns, as a special neural network, is used to deal with sequence problems, just as Conv is used to process pictures, which is a direction worth exploring.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.