Objective
the first article of the 2017.10.2 Blog Park, Mark.
Since the lab was doing NLP and medical-related content, it began to gnaw on the nut of NLP, hoping to learn something. Follow-up will focus on knowledge map, deep reinforcement learning and other content.
To get to the point, this article is a introduciton of using neural networks to deal with NLP problems. Hopefully, this article will have a basic concept of natural language processing (using NN).
All text and images quoted in this article are from the paper "A Primer on neural Network Models for Natural Language processing", author Yoav Goldberg.
I. Terminology interpretation
Feature : A concrete, linguistic input such as a word, a suffix, or a part-of-speech (part of speech, part of speech) tag. feature, a word, or a subscript , or a label for parts of speech
input vector: actual input that's the Fed to the neural-network classifier. Actual input to the neural network
input Vector Entry: A specific value of the Input. The input characteristic value
Introduction of two types of neural networks
1. fully connected feedforward Neural networks have many advantages over the traditional methods of classification tasks.
Application: A series of WORKS2 managed to obtain improved syntactic parsing results by simply replacing the linear model of a parse R with a fully connected Feed-forward network. Straight-forward applications of a Feed-forward network as a classifier replacement (usually coupled with the use of pre-t rained word vectors) provide benefits also for CCG supertagging,3 dialog state tracking,4 pre-ordering for statistical mac Hine Translation5 and Language modeling.6 Iyyer, Manjunatha, Boyd-graber, and Daum′e III () demonstrate that Multi-lay ER feed-forward networks can provide competitive results on sentiment classi-fication and factoid question answering.
2. CNN (mainly convolution and pooling layers) can find key features in non-stationary locations.
application : convolutional and pooling architecture show promising results on many tasks, including document classification , 7 Short-text categorization,8 sentiment classification,9 relation type classification between entities,10 event Detectio n,11 paraphrase identification,12 semantic role labeling,13 question answering,14 predicting box-office revenues of movies Based on critic reviews,15 modeling text interestingness,16 and modeling the relation between character-sequences and par T-of-speech tags.17
3. Convolutional neural networks allow us to encode sentences of any length into fixed-size sequences that reflect the most important features of the sentence, but this is achieved at the expense of the structural information of most sentences. Cyclic neural networks and recurrent neural networks allow us to process sequences and trees in the context of preserving most of the structure information. Recurrent neural networks are designed to generate sequences, recurrent neural networks are generalized recurrent neural networks that are used to handle tree structures, while recursive neural networks can also be used to process heaps (stacks).
Third, the characteristic description
1. Neural networks are generally considered as a classifier, and the dimensions of input x are Din, and one of the outputs of the dout is chosen as the output. Input x is the encoding of words, part-of-speech tagging, or other semantic words. The biggest difference from a sparse input linear model to a neural network is that a one-dimensional encoding, such as One-hot encoding, is stopped and a dense vector encoding is used instead. That is, each feature is encoded as a vector in a D-dimensional space. These core features can be trained like the parameters of a neural network. The dimensions of different features may also be different, such as the word feature may require 100 dimensions, but the part of speech features may only need 20 dimensions. As shown in.
The usual NLP discriminator processing flow is as follows:
A extracts the core feature sets associated with the output category.
B the corresponding vectors are found for each feature.
C consolidates each vector into the input vector x (multiple merge methods can be taken).
D Pass the input vector to NN.
In the meantime there are two difficulties, one is to convert the sparse representation of the vector to dense representation, and the other is to extract only the core vectors.
The difference between 2.one-hot encoding and dense coding
One-hot encoding Features: A dimension equals total feature number B features have no correlation
Dense encoding features: a dimension of D, less than the total feature B features similar distances will be relatively close.
The main benefit of dense encoding is the ability to encode similar words in similar terms. Then NN will be able to deal with two very similar vectors in the same way.
3. Representation of indefinite number of characteristics, Cbow and Wcbow
Cbow is to take the feature to mean, while the Wcbow is weighted for each feature.
4. Distance and position characteristics
In traditional NLP processing tasks, the distance between words and words is represented by distance, such as from 1 to 10 or more, and each distance corresponds to a one-hot encoding. In NN, however, the distance vector is similar to other features, where each distance is assigned a D-dimensional vector, which is trained along with the parameters of the neural network.
5. Feature Combination
NN deals only with core features, whereas traditional linear NLP processing systems require manual designation of features and their relationships. Traditional NLP requires a careful design of the relationship between features to ensure the linearity between features, and also to deal with the growing sequence of inputs along with the feature combination. But the designers of NN can expect the NN network to discover the latent characteristic relation by itself, without needing people to set the characteristic relation by hand. This greatly reduces the workload.
Nuclear methods, especially the polynomial kernel methods, are similar to NN and can focus only on the core features. But the computational scale of the nuclear method is related to the size of the input data. If the input is too large, the processing speed will be very slow.
6. Dimensions
The dimensions of the data do not have a common method to determine, in general, the dimension of the word is larger than the dimensions of the part of speech. The dimensions of the word are about 50 to hundreds of dimensions, and some may reach thousands of dimensions. The better way is to test some dimensions and find the best fit.
7. Vector sharing
For situations in which the same word may represent different meanings in context, it needs to be judged by experience. If the meaning of the same word is different in context, we need to contact context and assign different vectors.
Output of 8.NN
In general, it is a Class D classifier. However, it is possible to construct a d*k output matrix that indicates a class D output, but there is a K-link between the outputs. This means that the output is not completely independent and is associated.
Iv. Feedforward Neural Networks
explain the fundamentals of neural networks without additional detail.
1. Common non-linear functions
A SIGMOID The most common activation functions, but now in the network inner layer is not used frequently, the following functions are its alternative.
B TANH
C hard TANH is easier to calculate than TANH.
D ReLU The function is the best performing function in practice.
2. Output layer
Basically is the sigmoid function, select the most probability of the item to output. It is generally necessary for the network to be able to calculate the probability of output, such as cross entropy. When the network does not contain hidden layers, it is known as the maximum entropy model.
3. Embedding Layers Embedding Layer
Responsible for implementing word embedding. is generally considered to be part of a neural network.
Five, Word embedding
When there is enough data, the word embeds and trains the neural network, initializes the network as a random value, and then trains it. Some papers study the range of initialization parameter values (which are generally confirmed by the dimensions of the vectors).
In practice, some commonly occurring features, such as part-of-speech tagging and independent letters, are initialized with random values. Some words that have potential relationships, such as some independent words, will be initialized in a supervised or unsupervised manner. These pre-trained vectors can be considered as fixed parameters in the course of network training, or, more commonly, treated as random initialized values.
In most cases, the provision of training corpus is often insufficient, so that the supervised learning can not be well pre-training, so unsupervised training may be more common. The purpose of unsupervised training is to find the similarity between words and words, usually according to a criterion, that is, the similarity of words in similar contexts is very high.
Through a lot of unsupervised training, it can improve the generalization ability of the model, and can better deal with the words which are not appearing in the supervision training.
Common word embedding methods include Word2vec,glove, and the Collobert and Weston (,) embeddings algorithm.
Paper reading: A Primer on neural Network Models for Natural Language processing (1)