Introduction of recursive neural network in Tan Yin-layer neural network word embedding and sharing the criticism conclusion thanks
From: https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
Posted on July 7, 2014
Neural network, depth learning, characterization, NLP, recursive neural network Introduction
In the past few years, deep neural networks have dominated pattern recognition. They surface the previous artistic state for many computer vision tasks. Speech recognition is also evolving in this way.
But, despite the results, we have to wonder why they work so well.
This article reviews some of the most significant results of applying deep neural networks to natural language processing (NLP).
In doing so, I hope to give a hopeful answer that explains why deep neural networks can work. I think this is a very elegant angle of view. single hidden Layer neural network
A neural network with a hidden layer is universal: given enough hidden elements, it can approximate any function. This is a frequently cited-even more frequently misunderstood and applied-theorem.
Indeed, this is essentially because the hidden layer can be used as a lookup table.
For simplicity's sake, let's consider a sensor network. A perceptron is a very simple neuron that emits a signal when it exceeds a threshold and does not signal if it is not reached. The Perceptron network obtains binary (0 and 1) inputs and gives binary output.
Please note that the number of possible inputs is limited. For each possible input, we can construct a neuron in the hidden layer to excite the input, and only on that particular input. We can then use the connection between the neuron and the output neuron to control the output in a particular case.
Therefore, a hidden-layer neural network is indeed universal. But there is nothing particularly impressive or exciting about it. Saying that your model can do the same thing as looking up a table is not a very strong argument. This simply means that your model is not impossible to complete the task.
Versatility means that a network can adapt to any training data you give it. This does not mean that it will be interpolated in a reasonable way to the new data point.
No, versatility does not explain why neural networks are so effective. The real reason seems to be some more subtle .... To understand it, we first need to understand some concrete results. Word embedding
I want to start with a particularly interesting depth study: Word embedding. In my personal opinion, although they were originally presented by Bengio and others more than 10 years ago, they are still one of the most exciting areas of study in depth learning. In addition, I think they get intuition about why deep learning is so effective is one of the best places.
One word embedding w:words→ℝn w:w o r d s→r n W: \mathrm{words} \to \mathbb{r}^n is a parameterized function that maps words in some languages to high dimensional vectors (possibly 200-500 dimensions). For example, we might find that:
W (' cat ") = (0.2,-0.4, 0.7, ...) W (' cat ") = (0.2,-0.4, 0.7, ...) W (' \text{cat}\! ') = (0.2,~ \text{-}0.4,~ 0.7,~ ...)
W (' mat ') = (0.0, 0.6,-0.1, ...) W (' mat ') = (0.0, 0.6,-0.1, ...) W (' \text{mat}\! ') = (0.0,~ 0.6,~ \text{-}0.1,~ ...)
(Typically, a function is a lookup table, parameterized by a matrix Θθ\theta, with one row for each word: wθ (WN) =θn wθ (W N) =θn w_\theta (w_n) = \theta_n)
W W is initialized to each word with a random vector. It learns to have a vector of meaning to perform certain tasks.
For example, one task we can train a network for is to predict whether a 5-gram (five-word sequence) is "valid." We can easily get a lot of 5-grams from Wikipedia (for example, "cat sat on the mat") and then replace half of them with a random word (for example, "Cat sat song The mat"), because it almost certainly makes our 5-gram meaningless.
Our training model will run every word in 5-gram to get a vector representing it through w W, and input these vectors into another "module" named R R, which attempts to predict whether the 5-gram is "valid" or "interrupted." So, we want to:
R (W (' Cat '), W (' Sat '), W ('