Over the past few days, I have read some peripheral materials around the paper a neural probability language model, such as Neural Networks and gradient descent algorithms. Then I have extended my understanding of linear algebra, probability theory, and derivation. In general, I learned a lot. Below are some notes.
I,Neural Network
I have heard of neural networks countless times before, but I have never studied neural networks. In terms of image, neural networks are a system built by imitating biological neurons. People create it to solve problems that are difficult to solve by other methods.
For a single neuron, when the biostimulus intensity reaches a certain level, it will be stimulated and then make a series of responses. The following figure is taken to simulate this process:
X can be seen as a series of stimulating factors, while W can be seen as the weights corresponding to each stimulating factor. The weighted sum of each stimulus factor can be used as our total stimulus intensity. Note that x0 and wk0 represent random noise. After obtaining the stimulus intensity, we can make a certain response based on this intensity. This process is handled by the activation function in the figure. After processing, we get our final result.
Generally, there are three types of activation function functions (A and B can be regarded as one ):
Now, the basic structure of the neural network, single-layer neural network, has been introduced. For neural networks, we can use a simple single-layer neural network to construct a complex neural network to implement different functions. It can be like this (taking the difference indicated by the following parameter ):
The variables in this figure can be expressed as vectors:
In addition to this relatively simple multi-layer neural network, the network can also be like this:
For more information about this part, see:
An Introduction to Neural Networks: A Popular Science material made by IBM
Introduction to Neural Networks: This page is more standardized.
And Professor Hagan's book: neural_network_design
II,Gradient Descent Method
This method is easy to say. A very vivid metaphor is that if you stand on a hill and want to go down the hill, how can you go down the hill as soon as possible (by default, the speed is constant and you will not die )?
You should look around and find the steep current direction to go down the hill? In this direction, the gradient can be used for calculation, which is the source of the gradient descent method. Do you think it is very simple, think you have mastered it? Haha, it's still too young.
I will not go into details about this part. I will provide two materials for my study:
Professor Ng's second class in the Machine Learning Course (I will be able to hear it after I have heard it, but I still seem to understand a lot of information before );
Mathematics in Machine LearningThis blog is also very good. After reading the course, it is a little good to see it again.
By the way, the gradient descent method has another variant: the random gradient descent method. This is the method actually used in many occasions.
III,Combination of gradient descent and Neural Networks
Well, the gradient descent method and neural network are described above. The following describes their combined applications.
We need to use sample data to train our neural network model (in fact, to a large extent, we need to modify the parameters in the neural network, such as the weights shown above). How can we modify them? At that time, we should make changes against our goals. This often means that we need to maximize the number of values. When we mention the greatest value, is it a hook up with the gradient descent method. We can use the Gradient Descent Method to step through multiple rounds of iteration to reach our final goal. This is where they are combined. Oh, I heard that the so-called deep learning is basically the same idea. Does it instantly feel tall ??
Note that this part requires a lot of knowledge about linear algebra and derivatives, so you can complete it yourself. I have referred to the handouts of Professor Ng's machine learning course and the related content in neural_network_design.
A simple example: House Price
I learned the second course of machine learning by myself. I spoke very well!
The following is a complex example.
We divide a multi-layer neural network into three layers: the input layer, the output layer, and the hidden layer in the middle. We cannot directly calculate the gradient for the hidden layer in the middle, and then modify the parameters. Therefore, you can only use the export chain rules to calculate the gradient from the output layer and then update the parameters. This is also the legendary BP neural network. My learning materials are: Professor Hagan's book: neural_network_design. Here is a section dedicated to BP neural networks. The deduction process is a little long. I don't know it after reading it. So I tried to find time to read it again. It was very interesting, haha.
After reading this article, you can refer to this article:Backpropagation. Deepen your understanding.
IV,A neural probability Language Model
With the basic knowledge above, it is much easier to read this article. This article has a high position. The main highlights of this article are: first, use Word vectors to represent a word so that we can know the similarity between different words. Second, build a neural network for computing (BP neural network and gradient descent method); 3. Perform parallel computing to process large-scale data; the final and brightest part is that the experiment results of this method are very satisfactory!
This blog also mentions:Deep Learning in NLP (1) Word vector and Language Model
I feel that I have not understood the details of this article, for example, how does the network input come from?