Introduction to machine learning--talking about neural network

Source: Internet
Author: User
Tags constant min new set rand sin derivative of tanh

Introduction to machine learning--talking about neural network

This article transferred from: http://tieba.baidu.com/p/3013551686?pid=49703036815&see_lz=1#
Personal feel is very full, especially suitable for contact with neural network novice.

Start with the question of regression (Regression). I have seen a lot of people say that if you want to achieve strong AI, you have to let the machine learn to observe and summarize the rules of speech. Specifically, to let the machine observe what is round, what is square, distinguish between various colors and shapes, and then according to these characteristics to classify or predict something. In fact, this is the return problem.

How to solve the regression problem. We see something with our eyes and can see some of its basic features at once. But the computer. It sees just a bunch of numbers, so to let the machine from the characteristics of things to find the law, in fact, is a question of how to find the law in the numbers.

Example: If there is a string of numbers, the first six is known to be 1, 3, 5, 7,9,11, please ask the seventh is a few.
You can see it at one glance, it's 13. Yes, there are obvious mathematical laws between the numbers, all of them are odd, and they are arranged sequentially.
Well, what about this one. The first six are 0.14, 0.57, 1.29, 2.29, 3.57, 5.14, please ask the seventh is a few.
It's not so easy to see. We identify these numbers on the axis, and we can see the following graphs:

By connecting these points with a curve, the trend of the curve can be inferred from the seventh digit--7.
Thus, the regression problem is actually a curve fitting (Curve Fitting) problem. So how exactly should fit. The machine can not be like you, with the feeling of drawing on the fit, it must be through an algorithm.
Suppose there are a bunch of sample points that are distributed according to a certain pattern, let me take a fitting line as an example to say the principle of this algorithm.

In fact, it is very simple to draw a straight line, and then constantly rotate it. For each turn, calculate the distance (error) of each sample point and the corresponding point on the line, and find out the sum of the errors of all points. This keeps rotating, and when the sum of the errors is minimized, the rotation is stopped. More complicated, in the process of rotation, it is necessary to constantly translate the line, so constantly adjust until the error is minimized. This method is known as the gradient descent method (Gradient descent). Why is the gradient falling? In the process of rotation, when the error becomes more and more hours, the amount of rotation or movement is gradually smaller, and when the error is less than a small number, such as 0.0001, we can call it off (convergence, converge). If you turn around and turn your head again, it is not a gradient descent method.


We know that the formula for the line is y=kx+b,k for the slope, and b for the offset value (intercept on the y-axis). That is, K can control the angle of rotation of the line, B can control the movement of the line. It is emphasized that the essence of gradient descent method is to modify the two parameter values of K and B constantly, so that the final error is minimized.
It is better to use the additive (straight point-sample point) ^2 to find the error, which is more effective than the direct difference accumulation (straight point-sample point). This method of solving regression problems by minimizing the sum of squares of errors is called least squares (Least square method).

The problem is that it seems to have been solved, but we need a way to adapt to various curve fitting, so we need to continue in-depth study.
We draw a function curve based on the angle (slope) and the fitting error of the fitted line, as shown in figure:

As can be seen from the figure, the function curve of the error is a two-time curve, convex function (lower convex, convex), like the shape of a bowl, the minimum value is at the bottom of the bowl. If you draw a tangent at the bottom of the curve, the tangent line must be horizontal, and the horizontal axis can be seen as the tangent line in the graph. If we can find the tangent of each point on the curve, we can get the tangent in the horizontal state, that is, the tangent slope is equal to 0 o'clock coordinate value, this coordinate value is the minimum of the error we require and the final slope of the final fitting line.
Thus, the problem of gradient descent concentrates on the rotation of the tangent. When the tangent rotates to horizontal, the tangent slope = 0, the error drops to the minimum value.

The magnitude of each rotation of the tangent is called the learning rate (learning rates), and increasing the learning rate will speed up the fitting speed, but if the adjustment is too large it causes the tangent to spin too much to converge. [The learning rate is actually a pre-set parameter that doesn't change every time, but can affect the amplitude of each change. ]

Note: For the uneven error function curve, it is possible for the gradient to fall into the local optimal solution. There are two pits in the curve below, and tangents may tend to level at the bottom of the first pit.


A differential is a tool that specializes in curve tangents, and the slope of the tangent is called the derivative (derivative), denoted by dy/dx or F ' (x). Extended to multi-variable applications, if you want to find the tangent of multiple curves at the same time, then the slope of one of the tangents is called partial derivative (partial derivative), with ∂y/∂x, ∂ read "partial". As a practical application, we are generally dealing with multivariable, and the derivative I refer to later is also referred to as partial derivative.

The above is the basic content of linear regression (Linear Regression), on the basis of this method, the linear formula is changed to the curve formula, but also can be extended two regression, three regression, polynomial regression and many other curve regression. The following figure is the regression analysis feature of Excel.

In most cases, curve regression is more accurate than straight-line regression, but it also increases the complexity of fitting.

When the linear equation y=kx+b is changed to the two curve equation Y=ax^2+bx+c, the parameters (Parameter) are changed from 2 (k, B) to 3 (a, B, C respectively), and the feature (Feature) is changed from 1 (x) to 2 (x^2 and X). Three curves and complex polynomial regression will add more parameters and features.

Above is to summarize a series of numbers of the law, in real life we tend to based on a number of features (multi-string numbers) to analyze one thing, each primitive feature we are regarded as a dimension (Dimension). For example, a student's academic performance depends on the scores of Chinese, maths, English and other courses to judge comprehensively, here each course is a dimension. When using two-time curves and multivariable (multidimensional) fitting, the number of features increases exponentially, and the feature number = Dimension ^2/2 This formula can approximate the case of a feature increase, such as a 100-dimensional data, two polynomial fitting, the feature will be increased to 100*100/2=5000.

Here is a grayscale image of 50*50 pixels, and how many features it will have if it is fitted with two-time polynomial. --about 3 million.

Its dimension is 50*50=2500, and the characteristic number is =2500*2500/2=3,125,000. If it is a color picture, the dimension will increase to 3 times times the original, then the feature number will be increased to nearly 30 million.

Such a small picture, there is such a huge amount of features, you can imagine how our digital camera pictures will have a picture of how much characteristic. And what we're going to do is to look for patterns from 100,000 to hundreds of millions of such pictures, which is possible.
Obviously, the previous regression methods are not enough, we urgently need to find a mathematical model, can be based on the continuous reduction of features, reduce the dimension.

Thus, "artificial neural networks (Ann, Artificial Neural Network)" in such harsh conditions, the results of neuroscience research for the field of machine learning to open up a broad road.

Neurons

There is a hypothesis that "intelligence comes from a single algorithm (one Learning algorithm)". If this hypothesis is established, it is possible to use a single algorithm (neural network) to deal with the world's ever-changing problems. We don't have to program everything, just adopt the status quo strategy. There is growing evidence that this hypothesis, such as the early stages of human brain development, is uncertain about the division of responsibilities in each part, meaning that the part of the human brain that handles sound can actually handle visual imagery.

The following figure is a single neuron (Neuron), or a physiological structure of a brain cell:


The following is a mathematical model of a single neuron, which can be seen as a simplified version of the physiological structure, mimicking the kinda like:

Explain: +1 represents the offset value (offset, Bias Units), x1,x2,x2 represents the initial feature, W0,W1,W2,W3 represents the weight (Weight), which is the zoom factor of the feature, the feature is scaled and shifted, This is followed by an activation operation and then output. There are many kinds of activation functions, which will be explained in detail later.

To illustrate:

X1*w1+x2*w2+...+xn*wn This method of calculation is called weighted sum (Weighted sum) method, which is very common in linear algebra. The standard mathematical notation for weighted summation is, but in order to simplify, I use the symbolic representation of the witch Blair in the tutorial,
It's just a plus and a multiplication sign combination.

What is the meaning of this mathematical model? Let me illustrate with the example of the previous Y=kx+b line fitting.


At this point we change the activation function to Purelin (45 degrees straight line), Purelin is y=x, which means to keep the original value unchanged.
The output value is then the y line point = B + x straight point *k, or y=kx+b. See, just changed a vest only, still recognize out of it. Next, for each point of this operation, using Y-line points and Y sample points to calculate the error, the error is summed up, constantly update the value of B, K, thus constantly moving and rotating the line, until the error becomes very small stop (convergence). This process is exactly the linear regression of the gradient descent previously mentioned.

The accuracy of the general line fitting is much worse than the curve, so how we use the neural network to fit the curve. The answer is to use a non-linear activation function, the most common activation function is sigmoid (s-shaped curve), sigmoid is sometimes referred to as logistic Regression, referred to as Logsig. The formula for the Logsig curve is as follows:

There is also a kind of s-shaped curve is also very common, called hyperbolic tangent function (tanh), or called Tansig, can replace Logsig.

The following is their function graph, from which we can see that the value range of Logsig is 0~1, and the Tansig range is -1~1.


Natural constant E

The E in the formula is called the natural constant, also called Euler number, e=2.71828 .... E is a very mysterious figure, it is the essence of the "natural Law", which hides the mystery of natural growth, its graphic expression is a vortex-shaped spiral.


Into the spiral of e, in the process of continuous cycle scaling, it can completely maintain its original curvature unchanged, like a bottomless black hole, sucked into more things can also maintain the original shape. This is crucial. It allows our data to maintain its original proportional relationship after undergoing multiple sigmoid transformations.

E is how to come. E = 1 + 1/1! + 1/2! + 1/3! + 1/4! + 1/5! + 1/6! + 1/7! + ... = 1 + 1 + 1/2 + 1/6 + 1/24 + 1/120+ ... ≈2.71828 (! for factorial, 3!=1*2*3=6)

Another example of a popular point: once there was a rich man, he particularly greedy, like money lenders. The annual interest rate for the debt is 100%, which means borrowing 1 dollars and returning him 2 dollars a year later. One day, he thought of a bad idea, to calculate two times a year interest, the first half of 50%, the second half of 50%, so that the first half of 1 pieces 5, the second half of 1 5 of 50% to calculate, there is 1.5/2=0.75 yuan, add up a year is: the first half of 1.5+ 0.75 = 2.25 yuan. The formula is described as (1+50%) (1+50%) = (1+1/2) ^2=2.25 yuan. But he also thought that if the quarterly count, 4 times a year, it would not be more profitable. That is (1+1/4) ^4=2.44141, indeed more. He was very happy, so he thought, that simply every day to calculate, so that a year down is (1+1/365) ^365=2.71457. Then he wanted to count every second, and his housekeeper pulled him down and said that anyone else would go crazy. But the rich still do not forget, many years finally calculated, when the x tends to infinity, e= (1+1/x) ^x≈2.71828, the result he became a mathematician.

E is very important in the field of calculus, the derivative of e^x is still e^x, and its derivative happens to be itself, a coincidence that is unique within the real range.

A number of different titles:

The graphs of e^x and e^-x are symmetrical, and ln (x) is the inverse function of e^x, which is symmetric 45 degrees.

Neural network

Well, the front took a lot of space to introduce the activation function in the hidden mystery of E, below can formally introduce the network form of neurons.
The following figure is a few of the more common forms of networking:
The blue circle on the left is called "input layer", the middle orange no matter how many layers are called "hidden layer", the right green is "output layer". Each circle represents a neuron, also called a node. The output layer can have multiple nodes, and multi-node output is often used for classification problems. The theory proves that any multilayer network can be represented approximately by a three-layer network. It is common experience to determine how many nodes the hidden layer should have, and the number of nodes can be continuously adjusted during the test to achieve the best results.

Calculation method:
Although the figure is not identified, it is important to note that each arrow points to a line that has a weight (scale) value. Each node of the input layer is evaluated with a point-to-point calculation for each node of the hidden layer, and the calculated method is weighted sum + activation, as previously described. (The Red Arrows in the figure indicate the operational relationship of a node) each value computed by the hidden layer is calculated using the same method, and the output layer. The hidden layer uses sigmoid as the activation function, and the output layer is Purelin. This is because the Purelin can maintain the value scale of any previous range, making it easier to compare with the sample values, while the sigmoid range of values can only be between 0~1. At first, the values of the input layer are propagated to the hidden layer by the network computation, and then propagated to the output layer in the same way, the final output value and the sample value are compared, the error is calculated, this process is called forward propagation (Forward propagation).

As mentioned earlier, the use of gradient descent method, to constantly modify the K, b two parameter values, so that the final error to a minimum. Neural network can not only K, b two parameters, in fact, each connection line on the network has a weight parameter, how to effectively modify these parameters, so that the error minimized, become a very difficult problem. Since the birth of artificial neural network in the 60 's, people have been trying various ways to solve this problem. Until the 80 's, the error back propagation algorithm (BP algorithm) proposed, only then provides the real effective solution, causes the neural network research ledge.

BP algorithm is an effective method to calculate partial derivative, its basic principle is: using the results of the forward propagation of the final output to calculate the partial derivative of the error, and then use this partial derivative and the hidden layer before the weighted sum, so a layer of backward pass down, until the input layer (do not calculate the input layer), Finally, the partial derivative obtained by each node is used to update the weights.

For ease of understanding, I will use the word "residual error term" to denote the partial derivative of the error.

Output layer → hidden layer: residuals =-(Output value-sample value) * The derivative of the activation function
Hidden layers → hidden layers: residuals = (weighted sum of residuals for each node in the right layer) * The derivative of the activation function

If the output layer uses Purelin as the activation function, the derivative of the Purelin is 1, the output layer → hidden layer: residuals =-(Output value-sample value)

If using Sigmoid (LOGSIG) as the activation function, then: Sigmoid derivative = sigmoid* (1-sigmoid)
Output layer → hidden layer: residuals =-(Sigmoid output value-sample Value) * sigmoid* (1-sigmoid) =-(Output value-sample value) output value (1-output value)
Hidden layers → hidden layers: residuals = (weighted sum of residuals for each node in the right layer) * sigmoid* of the current node (1-sigmoid of the current node)

If Tansig is used as an activation function, then: tansig derivative = 1-tansig^2

After all the residuals have been calculated, the weights can be updated:
Input layer: Weight increase = sigmoid of current node * residual of corresponding node in right layer * learning rate
Hidden layer: Weight increase = input value * Residual of corresponding node in right layer * learning rate
Weight increase of offset value = residual of corresponding node in right layer * learning rate
The learning rate is described earlier, and the learning rate is a pre-set parameter that controls the amplitude of each update.

Thereafter, this calculation is repeated for all data until the error of the output reaches a very small value.
The above described is the most common type of neural network, called Feedforward Neural Network (feedforward neural networks), because it is generally to pass the error backward, so also known as the BP Neural network (back Propagation neural network).

Characteristics and limitations of BP neural Networks:
-BP neural network can be used as classification, clustering, prediction and so on. Need to have a certain amount of historical data, through the training of historical data, the network can learn the hidden knowledge of the data. In your problem, you first need to find some characteristics of some problems, as well as the corresponding evaluation data, using this data to train the neural network.
-BP neural network is mainly based on the practice of the gradual improvement of the system, not entirely based on bionics. From this perspective, practicality > Physiological similarity.
-Some algorithms in BP neural network, such as how to choose the initial value, how to determine the number of nodes in the hidden layer, and what activation function to use, have no conclusive theoretical basis, only some effective methods or empirical formulae based on practical experience.
-BP neural network is a very effective computational method, but it is also known for its many weaknesses, such as computational hyper-complexity, ultra-slow computation speed, easy to get into local optimal solution, and so on, so people put forward a lot of effective improvement programs, some new forms of neural networks are also emerging.

The text of the formula looks a bit around, below I send a detailed calculation process diagram.
Refer to this: Http://www.myreaders.info/03_Back_Propagation_Network.pdf I did the finishing




Here is the calculation of a record, immediately update the weight, after each calculation of a piece is immediately updated weight. In fact, the effect of batch update is better, the method is not to update the weight of the case, the record set of each record is calculated once, the added value to update all add up to the average, and then use the average to update the weight, and then use the updated weight for the next round of calculation, this method called batch gradient descent ( Batch Gradient descent).

Recommended entry-Level learning resources:

Andrew Ng's "Machine learning" public class: Https://class.coursera.org/ml
Coursera Public course note (Expression of neural network): Http://52opencourse.com/139/coursera Public Course notes-Stanford University machine Learning eighth-representation of neural networks-neural-networks-representation
Coursera Public Lesson Video (Neural network learning): Http://52opencourse.com/289/coursera Public Lesson Video-Stanford University Nineth lesson on machine learning-neural network learning-neural-networks-learning
Stanford Deep Learning Chinese version: Http://deeplearning.stanford.edu/wiki/index.php/UFLDL tutorial

Thank you for your support.
Today, first send a practical programming operation tutorial, introduce the use of MATLAB Neural Network Toolbox, and then add a bit more in-depth knowledge.

For an introductory tutorial on MATLAB, see this post: http://tieba.baidu.com/p/2945924081

Example 1: We all know that area = length * Width, if we have a set of numbers to measure according to the following:

We use this set of data to train neural networks. (Enter the following code in MATLAB, press ENTER to execute)

p = [2 5; 3 6; 12 2; 1 6; 9 2; 8 12; 4 7; 7 9] '; % characteristic Data x1,x2
t = [10 18 24 6 18 96 28 63]; % Sample Value
NET = NEWFF (p, T, 20); % Create a BP neural network Ff=feedforward
NET = Train (NET, p, T); % use p,t data to train this network

The following information appears, according to the Blue line display, you can see that the final convergence, the error is less than 10^-20.

You may ask, is it possible for a computer to learn the rules of multiplication? You don't have to recite the multiplication tables. Just pick a few numbers and try:

s = [3 7; 6 9; 4 5; 5 7] '; % prepare a new set of data for testing
y = Sim (NET, s)% simulate a bit and see the effect
% Result: 25.1029 61.5882 29.5848 37.5879

See, there is still a gap between the predicted results and the actual results. But it can also be seen that the predicted data is not blind, at least there is a little bit reliable. If there is more data in the training set, the accuracy of the predictions will be greatly improved.

The result of your test may be different from mine, because the weight parameters of the initialization are random and may fall into the local optimal solution, so sometimes the predicted results will be very unsatisfactory.

Example 2: Test the Fit sine curve below, this time we randomly generate some points to do the sample.

p = rand (1,50) *7% generates a random number between 1 rows of 50 0~7
t = sin (p)% calculates a sinusoidal curve
s = [0:0.1:7]; % generates a set of data for 0~7, interval 0.1, for simulation testing
Plot (P, T, ' X ')% draw scatter plot

NET = NEWFF (p, T, 20); % Create a neural network
NET = Train (NET, p, T); % Start Training

y = Sim (NET, s); % simulation
Plot (s, y, ' x ')% draw scatter plot

As we can see from the graph, the results of this prediction are obviously not ideal, we need to set some parameters to adjust.

The following settings are a standard batch gradient descent method configuration.

% Create 3-layer neural network [hide layer 10 nodes->logsig, output layer 1 nodes->purelin] Traingd represent gradient descent method
NET = NEWFF (P, T, ten, {' Logsig ' Purelin '}, ' Traingd '); % 10 cannot be written as [10 1]

% Set Training parameters
Net.trainparam.show = 50; % Show training results (training 50 times show once)
Net.trainparam.epochs = 500; % Total training times
Net.trainparam.goal = 0.01; % training target: Error <0.01
NET.TRAINPARAM.LR = 0.01; % Learning rate (learning rates)

NET = Train (NET, p, T); % Start Training

Note: NEWFF's third parameter 10 cannot be written as [10 1], otherwise it is a 4-layer network, two hidden layers, 10 and 1 nodes, respectively. This is easy to mistake. (The number of nodes in the output layer is automatically determined by the dimension of T, so do not specify)

y = Sim (NET, s); % simulation
Plot (s, y, ' x ')% draw scatter plot

The effect was obviously worse.

Turn the accuracy up a little bit and see. The number of training is added to 9999, the error <0.001; learning rate is adjusted to 0.06, hoping to speed up the point.

% Create 2-layer neural network [hide layer 10 nodes->logsig, output layer 1 nodes->purelin] Traingd represent gradient descent method
NET = NEWFF (P, T, ten, {' Logsig ' Purelin '}, ' Traingd ');

% Set Training parameters
Net.trainparam.show = 50; % Show training results 50 times per interval
Net.trainparam.epochs = 9999; % Total training times
Net.trainparam.goal = 0.001; % training target: Error <0.001
NET.TRAINPARAM.LR = 0.06; % Learning rate (learning rates)

NET = Train (NET, p, T); % Start Training

The standard batch gradient descent method is actually slow enough, and this time it takes more than a minute to calculate.

y = Sim (NET, s); % simulation
Plot (s, y, ' x ')% draw scatter plot

The effect is a little bit better than last time. However, this curve appears to be very difficult to see, this is an over-fitting (Overfitting) phenomenon, in contrast to the lack of fitting (underfitting).


First to solve the speed problem, the TRAINGD changed to TRAINLM can be. TRAINLM using LM algorithm is a nonlinear optimization method between Newton method and gradient descent method, which not only accelerates training speed, but also reduces the possibility of falling into local minimum value, which is the default value of MATLAB.

NET = NEWFF (P, T, ten, {' Logsig ' Purelin '}, ' TRAINLM ');
... The following code does not change


This speed is more amazing, completed in 1 seconds, only 6 rounds of calculation, the effect of some. However, the LM algorithm also has weaknesses, it occupies a very large amount of memory, so no other algorithms to eliminate.

The following solves the problem of fitting, and the number of nodes in the hidden layer is set to a little less on the line.

NET = NEWFF (P, T, 3, {' Logsig ' Purelin '}, ' TRAINLM ');
... The following code does not change

This time finally achieved satisfactory results. (Sometimes the local optimal solution can occur, you may try a few more times)

If the number of nodes is too small, there is a case of less than fit.

The number of nodes in the hidden layer is generally to be adjusted by feeling. If the number of dimensions of the training set is more, the adjustment is more time-consuming, then can be adjusted according to the empirical formula up and down.
Here are a few empirical formulas for reference:

What would it look like if I changed the output layer to Logsig activation?

NET = NEWFF (P, T, 3, {' Logsig ' logsig '}); % Create a neural network
NET = Train (NET, p, T); % Start Training
y = Sim (NET, s); % simulation
Plot (s, y, ' x ')% draw scatter plot

As you can see, the points between the -1~0 range become 0. To get the effect of a full range of values when using the Logsig output, the data must be normalized before it can be used.

Normalization (normalization), also known as normalization, is to scale a bunch of numbers proportionally to the range of 0~1 or -1~1.
Although the output with Purelin can not be normalized, but normalization can speed up the convergence to some extent, so many tutorials are set as a necessary step before training.

The formula is: normalized value = (current value x min min)/(max max-min min)
If the range is limited, the formula is: y = (ymax-ymin) * (x-xmin)/(xmax-xmin) + ymin;
Range of 0.1~0.9: (0.9-0.1) (x-min)/(Max-min) (0.9-0.1) +0.1
Put the 5, 2, 6, 3 of these four numbers normalized:

The normalization command for MATLAB is: Mapminmax
Note: Many of the online tutorials with premnmx command to one, to note that the MATLAB version r2007b and r2008b,premnmx in the processing of single-column data Bug,matlab has given a warning, r2009a version was corrected. It is therefore recommended to use Mapminmax. The input and output values and premnmx of the Mapminmax are reversed, so be aware of whether to add transpose symbols in the code when using them.

A = [5, 2, 6, 3];

b = Mapminmax (A, 0, 1)% normalized to 0~1
% B = 0.7500 0 1.0000 0.2500

c = Mapminmax (a)% normalized to -1~1
% c = 0.5000-1.0000 1.0000-0.5000

Inverse Normalization (denormalization) is the reduction of values at normalized proportions.

A = [5, 2, 6, 3];
[C,ps] = Mapminmax (a); % PS Records normalized proportions
Mapminmax (' reverse ', C, PS)% using PS inverse normalization
% ans = 5 2 6 3

Normalization of Neural networks (0~1 range) code:

p = rand (1,50) *7; % characteristic data
t = sin (p); % Sample Value
s = [0:0.1:7]; % test Data

[PN, ps] = Mapminmax (p, 0, 1); % Characteristic data normalization
[tn, TS] = Mapminmax (t, 0, 1); % of sample values normalized
sn = Mapminmax (' Apply ', S, PS); % test data, scaled by PS scale

NET = NEWFF (pn, TN, [5 1], {' Logsig ' logsig '}); % Create a neural network
NET = Train (NET, PN, TN); % Start Training

Yn = sim (NET, SN); % simulation
y = Mapminmax (' Reverse ', yn, ts); % Restore by PS ratio
Plot (s, y, ' x ')% draw scatter plot


The Neural network Toolbox also has a UI graphical operator interface that can be opened by executing nntool. I don't think it's easy to write code, so it doesn't work. I provide a link to a related tutorial, interested can see: Matlab Neural network Toolbox to create a neural network-http://blog. Sina com.cn/s/blog_8684880b0100vxtv.html (Sina replaced by Sina)

On the origin of sigmoid, there are few mentions on the Chinese website. Here is a brief talk, I hope to give you an extension of ideas.

PS: Here's the formula I have given the solution process, but now this year, by hand to solve the problem less and fewer people, the general equation with software to solve the line.
For example, to solve sigmoid differential equations, you can use MATLAB to solve:

Dsolve (' dx=x* (1-x) ')
% ans = 1/(1+exp (-t) *c1)

If you want to solve the steps or more detailed information, we recommend the use of wolfram:http://www.wolframalpha.com
Enter X ' =x (1-x) in the Wolfram search box.

Logsig

The sigmoid function (S-shape function, Logistic function) is an activation function that is inspired by the statistical model.
The biology-based neuron activation function is this:

See: Http://eprints.pascal-network.org/archive/00008596/01/glorot11a.pdf

The practice proves that the effect of sigmoid function activation based on statistics is better than that based on biology, and it is convenient to calculate, so we can't judge the AI algorithm by the similarity of machine and man.
The sigmoid function was originally a mathematical model describing population growth, and 1838 suggested that the derivative form (probability density) was given. The law of population growth: the initial stage is roughly exponential growth, then gradually becomes saturated, growth slows down, and when it reaches maturity it almost ceases to grow; the whole process is shaped like an S-shaped curve.

The form of the derivative is known, so what does its original function look like? The known derivative of the original function, in statistical terms, that is, based on the probability density function (PDF) to calculate the cumulative distribution function (CDF), indefinite integral (indefinite Integral) is specifically used to do this tool.
Based on the knowledge of indefinite integrals, it is known that because the constant term is variable, there is a possibility of countless original functions. Let us first look at the graphic method: Since the derivative is the slope of the function curve, then the slope within a certain range of values can be drawn as a short slash of the root, the Slope field (Slope fields, Direction), and then according to the trend of these diagonal, draw the integral curve.
MATLAB can use the Quiver command to draw the slope field.

As can be seen from the above figure, there is a watershed between the 0~1 of the y-axis, and the direction at 0 and 1 tends to be horizontal. Enlarge the range of 0~1 below to see what it looks like.

See, the logistic sigmoid we want is right here.

The following is a procedure for symbolic solving:


Tansig

Hyperbolic tangent function (bipolar s-shape function, Tanh, hyperbolic Tangent), read tanch,18 century has appeared. It is defined as: Tanh (x) =sinh (x)/cosh (x), which can be deduced from the famous Euler's formula (Euler's formula).
With Tanh as the activation function, the convergence is faster and the effect is better than the logistic function.
Euler's formula: I is the imaginary number (imaginary number) unit, which is defined as: (i.e. i^2 =-1)
Digression: According to the above formula transformation, you can get the most beautiful mathematical formula in history: the most mysterious 5 symbols in mathematics e, I, π, 1 and 0, all contained inside.

To find the derivative of Tanh:

The relationship between Logsig and Tansig:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.