Summary
This section will take a visual understanding of the reverse propagation . Reverse propagation is a method of calculating the gradient of an expression recursively using the chain rule . Understanding the reverse propagation process and its subtleties is critical for understanding, implementing, Designing, and debugging neural networks. The core problem of reverse derivation is that the given function $f (x) $, where $x $ is the vector of the input data, the function needs to be calculated $f $ on the gradient of $x $, that is, $\nabla f (x) $.
The reason for this concern is that in a neural network $f $ corresponds to a loss function $L $, and the input contains the weights of the training data and the neural network. For example, the loss function can be Hinge Loss, whose input contains the training data $ (x_i,y_i) _{i=1}^n$, the weight W and the deviation B. Note that the training set is given (which is usually the case in machine learning), and the weights are the variables to be adjusted. Therefore, even if the input data can be computed in reverse propagation $x the gradient on the _i$, but in practice in order to update the parameters, usually only the gradient of the parameter is calculated. $x _i$ gradients are sometimes useful, for example, by visualizing what a neural network does to make it easier to visualize and use. Understanding Gradients
Starting with simple expressions, you can lay out symbols and rules based on complex expressions. Consider a simple two-element multiplication function $f (x, y) = xy$. It is easy to find the partial derivative of the two input variables separately:
\[f (x, y) = xy \rightarrow \frac{d f}{d x} = y, \ \ \ \frac{d f}{d y} = x \]
Keep in mind the meaning of these derivatives: function variables change at a point $x the neighborhood, and the derivative is the rate of change in the direction of the function caused by the variable change.
\[\FRAC{DF (x)}{dx} = \lim_{h \rightarrow 0} \frac{f (x + H) –f (x)}{h}\]
Note the semicolon on the left side of the equals sign differs from the semicolon to the right of the equals sign. Instead, this symbol indicates that the operator $\frac{d}{d x}$ is applied to the function $f $ and returns a different function (derivative). For the above formula, it can be thought that the $h $ value is very small, the function can be approximated by a straight line, and the derivative is the slope of the line. In other words, the derivative of each variable indicates how sensitive the entire expression is to the value of the variable. For example, if $x = 4, y = -3$, then $f (x, y) =–12$, $x the derivative of $ \frac{d f}{d x} = -3$. This means that if the value of the variable $x $ is larger, the value of the entire expression is smaller (because of the minus sign), and the amount that is smaller is three times times greater than the amount of $x $. You can see this by rearranging the formulas:
\[f (x + h) = f (x) + H \frac{d f}{d x}\]
Similarly, because $\frac{d f}{d y} = 4$, you know that if you increase the value of $y $ $h $, the output of the function will also increase (because of a positive sign), and the increment is $4h$. The derivative of the function on each variable indicates how sensitive the entire expression is to the variable.
As mentioned above, the gradient $\nabla F $ is a derivative of the vector, so there are
\[\nabla f = \left [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}\right] =[y,x]\]
Even though the gradient is actually a vector, the term "gradient on x " is often used instead of the correct argument, such as thepartial derivative of x , because the former is simple to say. You can also take a derivative of the addition operation:
\[f (x, y) = X+y \rightarrow \frac{df}{dx} = 1, \ \ \frac{df}{dy} = 1 \]
This means that, regardless of its value, $x, the derivative of y$ is 1. This makes sense, because the value of the function $f $ increases, regardless of the increment of the $x, and the increase in the rate of change is independent of the $x, y$ the specific value of the y$ (conditions and multiplication operations differ). The maximum value operation is also often used:
\[f (x, y) = \max (x, y) \rightarrow \frac{df}{dx} = \mathbb{i} (× \ge y), \ \ \ \frac{df}{dy} =\mathbb{i} (y \ge x) \]
The above means that if the variable is larger than the other variable, then the gradient is 1 and vice versa is 0. For example, if the $x =4,y=2$, then $\max$ is 4, so the function is not sensitive to $y $. That is, adding $h $ on $y $, the function or output is 4, so the gradient is 0: because there is no effect on the function output. Of course, if you add a large amount to the $y $, such as greater than 2, then the value of the function $f $ changes, but the derivative does not indicate a significant change in the input amount to the effect of the function, they only apply to the change in the input volume of the extreme hours, because the definition has been indicated: $h \rightarrow 0$.
Using chain rules to calculate composite expressions
Now consider more complex composite functions that contain multiple functions, such as $f (x, y, Z) = (\cdot z$). Although this expression is simple enough to differentiate directly, it uses a method that helps the reader intuitively understand the reverse propagation. Divide the formula into two parts: $q = x + y$ and $f = qz$. In the previous article we have described how to calculate this separate two formulas, because $f $ is $q $ and $z $ multiplied, and because $q $ is $x $ plus $y $ so:
\[\frac{\partial f}{\partial Q} = z, \frac{\partial f}{\partial z} = q, \ \frac{\partial q}{\partial x} = 1, \ \frac{\pa Rtial q}{\partial y} = 1 \]
However, there is no need to care about the gradient of the median $q $, because \frac{\partial f}{\partial q} is useless. Instead, the function $f $ about the $x, y,z$ gradient is the need to pay attention to. The chain rule states that the correct way to link these gradient expressions is to multiply them, such as:
\[\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \cdot \frac{\partial q}{\partial x} \]
In practice, this simply multiplies the two gradient values, the sample code is as follows:
# 设置输入值x = -2; y = 5; z = -4# 进行前向传播q = x + y # q becomes 3f = q * z # f becomes -12# 进行反向传播:# 首先回传到 f = q * zdfdz = q # df/dz = q, 所以关于z的梯度是3dfdq = z # df/dq = z, 所以关于q的梯度是-4# 现在回传到q = x + ydfdx = 1.0 * dfdq # dq/dx = 1. 这里的乘法是因为链式法则dfdy = 1.0 * dfdq # dq/dy = 1
Finally get the gradient of the variable [DFDX, Dfdy, Dfdz], which tells us how sensitive the function $f $ to the variable [x, Y, z]. This is one of the simplest reverse propagation. A more concise notation is generally used, so you don't have to write DF. This means that DQ is used instead of DFDQ and always assumes that the gradient is about the final output.
This calculation can be visualized as a computed line image as follows:
The calculation of the real value of the circuit shows the visualization process of the calculation. Forward propagation from input calculation to output (green), reverse propagation starts at the tail, calculates the gradient forward recursively (shown in red) to the input end of the network, according to the chain rule. It can be thought that the gradient is reflux from the computational link.
intuitive understanding of reverse propagation
The reverse propagation is a graceful local process. In the entire computed circuit diagram, each gate unit will get some input and immediately calculate two things: 1. The output value of this gate, and 2. Its output value is about the local gradient of the input value. The door unit completes these two things are completely independent, it does not need to know the calculation line of other details. However, once the forward propagation is complete, in the process of reverse propagation, the gate unit keeper finally obtains the gradient of the final output value of the entire network on its own output value. The chain rule states that the gate unit should multiply the return gradient by the local gradient of its input, resulting in a gradient of the output of the entire network to each input value of the gate unit.
The
multiplication operation for each input is based on the chain rule. This operation makes a relatively independent gate element an integral part of a complex computational circuit, which can be a neural network.
The following is an example of how this process can be understood. The addition gate receives input [-2, 5], and the computed output is 3. Since this gate is an addition operation, the local gradient for the two inputs is +1. The rest of the network calculates the final value of-12. In reverse propagation, the chain rule is used recursively, when the addition gate (the input of the multiplication gate) is counted, the gradient of the output of the addition gate is 4. If the network wants to output a higher value, it can be thought that it will want the output of the addition gate to be smaller (because of the minus sign), and there is a multiple of 4. Continue to recursively apply the chain rule to the gradient, add the gate to get the gradient, and then multiply the gradient to the local gradient of each input value (that is, let 4 times the local gradient of x and y , the local gradient of x and Y is 1, so the end is-4). You can see the desired effect: if x, Y is reduced (their gradient is negative), then the output value of the addition gate decreases, which increases the output value of the multiplier gate.
As a result, reverse propagation can be seen as the gate unit communicating with each other through a gradient signal, as long as their input changes in the gradient direction, regardless of their own output values rise or fall, is to make the entire network output value. Modularity: sigmoid Example
The doors described above are relatively random. Any function that can be differentiated can be seen as a gate. Multiple doors can be combined into a single door, or a function can be split into multiple doors as needed. Now look at an expression:
As you can see in the following lesson, this expression describes a 2-dimensional neuron with an input x and a weight of w , which uses the sigmoid activation function. But now just as a simple input for x and W, output to a number of functions. This function is made up of several gates. In addition to the above introduction of the addition gate, multiplication gate, take the maximum gate, there are the following 4 kinds:
The function uses a constant translation of the input values and enlarges the input values by constant times. They are exceptions to addition and multiplication, but this is considered a unary gate unit because it does need to calculate the gradient of a constant. The entire calculation line is as follows:
———————————————————————————————————————
Examples of 2-dimensional neurons using the sigmoid activation function. The input is [x0, X1], the weights that can be learned are [W0, W1, W2]. Later on, the neuron does a dot product operation on the input data, and then its activation data is squeezed from 0 to 1 by the sigmoid function.
————————————————————————————————————————
In the example above, you can see a long chain of function operations, and the doors on the chain operate on the dot product result of W and x . This function is called the sigmoid function. The derivation of the sigmoid function for its inputs can be simplified (using the technique of first adding and then minus 1 on the molecule):
You can see that the gradient calculation is much simpler. For example, the sigmoid expression input is 1.0, and the output is calculated as 0.73 in forward propagation. According to the above formula, the local gradient is (1-0.73) *0.73~=0.2, compared with the previous calculation process, the current calculation uses a single simple expression. Therefore, it is useful to put these operations into a single door unit in a real-world application. The code for the reverse propagation of the neuron is implemented as follows:
w = [2,-3,-3] # 假设一些随机数据和权重x = [-1, -2]# 前向传播dot = w[0]*x[0] + w[1]*x[1] + w[2]f = 1.0 / (1 + math.exp(-dot)) # sigmoid函数# 对神经元反向传播ddot = (1 - f) * f # 点积变量的梯度, 使用sigmoid函数求导dx = [w[0] * ddot, w[1] * ddot] # 回传到xdw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # 回传到w# 完成!得到输入的梯度
implementation tip: Fragment reverse propagation . The code above shows that in practice, in order to make the reverse propagation process more concise, it would be helpful to divide the forward propagation into different stages. For example, we create an intermediate variable dot, which is loaded with the result of the dot multiplication of w and x . In the case of reverse propagation, the corresponding variables (such as DDOT,DX , and DW) can be calculated (in reverse) with gradients such as W and x .
The main point of this section is to show the details of the reverse propagation process, and which functions can be combined into a gate to simplify the forward propagation process. It is useful to know which parts of the expression are relatively concise, so they can "chain" together, making the code less and more efficient. Reverse propagation Practice: segmented Computing
See another example. Suppose you have the following function:
The first thing to say is that this function is completely useless, the reader will not use it for gradient calculation, this is only used as an example of the reverse propagation of practice, it should be emphasized that if the differential operation, the end of the operation will be a large and complex expression. However, it is not necessary to do such a complex operation, because we do not need a definite function to calculate the gradient, just know how to use the inverse propagation to calculate the gradient. Here is the code pattern for building forward propagation:
x = 3 # 例子数值y = -4# 前向传播sigy = 1.0 / (1 + math.exp(-y)) # 分子中的sigmoi #(1)num = x + sigy # 分子 #(2)sigx = 1.0 / (1 + math.exp(-x)) # 分母中的sigmoid #(3)xpy = x + y #(4)xpysqr = xpy**2 #(5)den = sigx + xpysqr # 分母 #(6)invden = 1.0 / den #(7)f = num * invden # 搞定! #(8)
┗| ' O′|┛ AW ~ ~, to the end of the expression, it completes the forward propagation. Note that when building code s, multiple intermediate variables are created, each of which is a simpler expression, and their method of calculating the local gradient is known. It is simple to calculate the reverse propagation: we return each variable ( sigy, num, SIGX, Xpy, Xpysqr, den, Invden ) for forward propagation. We will have the same number of variables, but all start with D to store the gradient of the corresponding variable. Note that the local gradient of the expression is included in each small block of the reverse propagation, and then multiplied by the upstream gradient using the chain rule. For each line of code, we will indicate which part of the forward propagation it corresponds to.
# backhaul f = num * Invdendnum = invden # Molecular Gradient # (8) Dinvden = num # (8) # backhaul Invden = 1.0/den Dden = ( -1.0/(den**2)) * Dinvden # (7) # backhaul den = sigx + XPYSQRDSIGX = (1) * Dden # (6) Dxpysqr = (1) * Dden # (6) # backhaul XPYSQR = Xpy**2dxpy = (2 * xpy) * DXPYSQR # (5) # backhaul xpy = x + ydx = (1) * Dxpy # (4) dy = (1) * dxpy # (4) # backhaul SIGX = 1.0/(1 + math.exp (x)) dx + = ((1-SIGX) * sigx) * DSIGX # Notice + =!! See notes below # (3) # Callback num = x + sigydx + = (1) * Dnum # (2) Dsigy = (1) * Dnum # (2) # backhaul Sigy = 1.0/(1 + math.exp (-y)) dy + = ((1-sigy) * sigy) * Dsigy # (1) # Complete! Ouch ~ ~
Some things to be aware of:
caching forward propagation variables : Some of the intermediate variables that are obtained during forward propagation are useful when calculating reverse propagation. In practice, the best code implements caching of these intermediate variables so that they can be used in the case of a reverse propagation. If this is too difficult, you can (but waste computing resources) recalculate them.
in different branches of the gradient to add : If the variable x, y in the forward propagation of the expression appears multiple times, then the reverse propagation should be very careful, using + = rather than = to accumulate the gradient of these variables (otherwise it will cause overwrite). This is followed by the multivariate chain rule in calculus, which points out that if a variable is branching into different parts of a line, the gradient should accumulate when it is returned. Patterns in the callback stream
An interesting phenomenon is that in most cases, the gradients in the reverse propagation can be interpreted intuitively. For example, the most commonly used addition, multiplication, and maximum value of three gate units in neural networks, their behavior in the reverse propagation process has a very simple explanation. Let's look at the following example:
——————————————————————————————————————————
An example of a display of reverse propagation. The addition operation distributes the gradient equally to its input. Take the maximum action to route the gradient to a larger input. The multiplication gate takes the input to activate the data, swaps them, and multiplies the gradients.
——————————————————————————————————————————
From the above example:
The addition Gate Unit distributes the output gradient equally to all its inputs, which are independent of the value of the input value in the forward propagation. This is because the local gradient of the addition operation is simple +1, so all the input gradients are actually equal to the output gradient, because multiplying by 1.0 remains the same. In the above example, the addition gate does not change the gradient 2.00 and routes it equally to two inputs.
take the maximum gate element to route the gradient. Unlike the addition gate, the maximum keeper gradient is transferred to one of the inputs, which is the largest value in the forward propagation. This is because in the maximum gate, the local gradient of the highest value is 1.0 and the remainder is 0. In the example above, the maximum keeper gradient 2.00 is transferred to the z variable, because the z value is higher than w , so the gradient of W remains at 0.
The multiplication Gate unit is relatively difficult to interpret. Its local gradient is the input value, but it is exchanged after each other, and then multiplied by the chain rule to the output value of the gradient. In the above example, the gradient ofx is -4.00x2.00=-8.00.
non-intuitive effects and their results . Note that if one of the inputs of the multiplication Gate unit is very small and the other input is very large, then the operation of the multiplication gate will be less intuitive: it will allocate the large gradient to the small input and the small gradient to the large input. In a linear classifier, the weights and inputs are dot product, which indicates that the size of the input data has an effect on the size of the weight gradient. For example, by multiplying all the input data samples by 1000 during the calculation, the weight gradient will increase by 1000 times times, so the learning rate must be reduced to compensate. That's why data preprocessing is so important that it can have a huge impact even if it's just a small change. There is an intuitive understanding of how gradients flow in the computational circuitry, which can help the reader debug the network. Calculate gradients with vectorization operations
Each of these considerations is a single variable case, but all concepts apply to matrix and vector operations. However, you should pay attention to the dimension and transpose operations when you are working.
matrix multiplication Gradient : Perhaps the most skillful operation is multiplication of matrices (also applicable to matrices and vectors, vectors and vectors multiplied):
# 前向传播W = np.random.randn(5, 10)X = np.random.randn(10, 3)D = W.dot(X)# 假设我们得到了D的梯度dD = np.random.randn(*D.shape) # 和D一样的尺寸dW = dD.dot(X.T) #.T就是对矩阵进行转置dX = W.T.dot(dD)
tip: To analyze dimensions! Note that there is no need to memorize the expression of DW and DX , as they are easily deduced from the dimension. For example, the size of the weighted gradient DW must be the same as the size of the weight matrix W, which is also determined by the matrix multiplication of x and DD (in the example above x and W are numbers that are not matrices). There is always a way to make the dimensions available to each other. For example, the size ofX is [10x3], the size ofDD is [5x3], if you want the size of DW and W is [5x10], then Dd.dot (x.t).
use small and specific examples : Some readers may find it difficult to calculate the gradient of vectorization operations by writing a small, unambiguous example of vectorization, calculating gradients on paper, and then generalizing them to obtain an efficient vectorization operation. Summary
Have a visual understanding of the meaning of the gradient, know how the gradient in the network in reverse, know how they communicate with different parts of the network and control its rise or decrease, and make the final output value high.
The importance of piecewise computation in the implementation of reverse propagation is discussed. The functions should be divided into different modules, so that it is relatively easy to calculate the local gradient and then "chain" it based on the chain rule. It is important that you do not need to write these expressions on paper and then calculate the full derivation formula for it, because it does not really need a mathematical formula for the gradient of the input variable. You only need to divide the expression into different modules that can be derivative (the module can be a multiplication operation of a matrix vector, or take the maximum value operation, or an addition operation, etc.), and then calculate the gradient in reverse propagation step-by-step.
In the next lesson, we will begin to define neural networks, and the reverse propagation allows us to efficiently calculate the gradient of the loss function for each node of the neural network. In other words, we are now ready to train the neural network, and the hardest part of the course is over! Convnets is a small step forward.
Links: https://zhuanlan.zhihu.com/p/21407711
CS231N (c) Error reverse propagation