R Language Neural Network algorithm

Source: Internet
Author: User
Tags network function svm true true

Artificial neural Network (ANN), or neural network, is a mathematical model or a computational model for simulating the structure and function of biological neural networks. Neural networks are computed by a large number of artificial neuron junctions. In most cases, the artificial neural network can change the internal structure based on the external information, and it is an adaptive system. Modern neural network is a non-linear statistical data modeling tool, which is often used to model the complex relationships between input and output, or to explore the patterns of data.

The artificial neural network simulates human intelligent behavior from the following four aspects:

Physical Structure: artificial neurons simulate the function of a biological neuron

computational simulations: neurons in the human brain have local computing and storage functions that are connected to form a system. In the artificial neural network, there are also a large number of neurons with local processing capability, and the information can be processed massively in parallel.

Storage and Operation: both the human brain and the artificial neural network are capable of memory storage through the connection strength of neurons, and provide strong support for generalization, analogy and generalization.

Training: like the human brain, artificial neural networks will use different training and learning processes to automatically acquire relevant knowledge based on their own structural characteristics.

A neural network is an operational model that consists of a large number of nodes (or "neurons", or "units") and connections. Each node represents a specific output function, called an excitation function. The connection between each of the two nodes represents a weighted value for passing the connection signal, called the weight, which is equivalent to the memory of the artificial neural network. The output of the network depends on the connection mode of the network, the weight value and the excitation function are different. The network itself is usually the approximation of some kind of algorithm or function in nature, and it may be the expression of a logical strategy.



First, the perception device

The Perceptron is equivalent to a single layer of a neural network, consisting of a linear combination and an original binary threshold value:



A single-layer perceptron that forms the Ann system:

The perceptron computes a linear combination of these inputs with a real value vector, and outputs 1 if the result is greater than a certain threshold, otherwise the output ‐1.

The Perceptron function can be written as: sign (w*x) can sometimes be added to the bias B, written as "W*x+b"

Learning a perceptron means choosing the right W0,..., wn value. So the candidate hypothesis that the Perceptron learns to consider is that the space H is the set of all possible real-valued weights vectors.

Algorithm Training steps:

1. Define variables and parameters x (input vector), W (weight vector), B (offset), Y (actual output), d (desired output), a (learning rate parameter)

2, initialization, n=0,w=0

3. Enter the training sample and specify its expected output for each training sample: Class A is recorded as 1, Class B is 1

4. Calculate the actual output y=sign (w*x+b)

5. Update weights vector W (n+1) =w (n) +a[d-y (n)]*x (n), 0

6, judgment, if the convergence condition is satisfied, the algorithm ends, otherwise returns 3

Note that the learning rate a for the stability of the weight should not be too large, in order to reflect the error on the weight of the correction should not be too small, in the final analysis, this is an empirical problem.

From the previous narration, the Perceptron is bound to the linear fractal example, and it can't be classified correctly for the non-divided problem. The idea behind the support vector machines we talked about here is very similar, but the way to determine the classification line is different. It can be said that for the linear fractal example, the support vector machine found the "best" of the categorical line, and the single-layer perceptron found a feasible line.

We take the iris data set as an example, because the single-layer perceptron is a two classifier, so we have the iris data also divided into two categories, "Setosa" and "Versicolor" (the latter two categories are considered the 2nd Class), then the data according to the characteristics: petal length and width of the classification.

Run the following code:

#感知器训练结果:

a<-0.2

W<-rep (0,3)

Iris1<-t (As.matrix (Iris[,3:4]))

D<-c (Rep (0,50), Rep (1,100))

E<-rep (0,150)

P<-rbind (Rep (1,150), IRIS1)

max<-100000

Eps<-rep (0,100000)

i<-0

repeat{

v<-w%*%p;

Y<-ifelse (sign (v) >=0,1,0);

e<-d-y;

Eps[i+1]<-sum (ABS (E))/length (e)

if (eps[i+1]<0.01) {

Print ("Finish:");

Print (w);

Break

}

w<-w+a* (d-y)%*%t (p);

i<-i+1;

if (I>max) {

Print ("Max time Loop");

Print (Eps[i])

print (y);

Break

}

}

#绘图程序

Plot (Petal.length~petal.width,xlim=c (0,3), Ylim=c (0,8),

data=iris[iris$species== "Virginica",])

data1<-iris[iris$species== "Versicolor",]

Points (data1$petal.width,data1$petal.length,col=2)

data2<-iris[iris$species== "Setosa",]

Points (data2$petal.width,data2$petal.length,col=3)

X<-seq (0,3,0.01)

y<-x* (-w[2]/w[3])-w[1]/w[3]

Lines (x,y,col=4)

#绘制每次迭代的平均绝对误差

Plot (1:i,eps[1:i],type= "O")

Classification results



This is the result of running 7 times. Compared with the support vector machine in front of us, it is obvious that the single-layer perceptron classification of neural networks is not so credible and somewhat weak.

We can try to do cross-validation, we can find that cross-validation results are not ideal.

Second, linear neural network

Although when the training sample is linearly separable, the Perceptron rule can successfully find a weight vector, but it will not converge if the sample is not linearly separable. As a result, another training rule has been devised to overcome this deficiency, called the Delta rule.

If the training sample is not linear, then the delta rule converges to the best approximation of the target concept.

The key idea of the delta rule is to use gradient descent to search the hypothetical space of a possible weight vector to find the right vector for the best fit training sample.

We describe the algorithm as follows:

1. Define variables and parameters. X (Input vector), W (weight vector), B (offset), Y (actual output), d (desired output), a (learning rate parameter) (for easy narration, we can incorporate the bias into the weight vector)

2. Initialize W=0

3. Input sample, calculate actual output and error. E (n) =d-x*w (n)

4. Adjust weight vector W (n+1) =w (n) +a*x*e (n)

5, judge whether convergence, convergence end, otherwise return 3

Hayjin proves that as long as the learning rate a<2/maxeign, the delta rule is convergent by variance. Where Maxeigen is the maximum eigenvalue of x ' X. Therefore, we use 1/maxeign as a value here.

Let's take the iris data above as an example of this problem. To run the code:

P<-rbind (Rep (1,150), IRIS1)

D<-c (Rep (0,50), Rep (1,100))

W<-rep (0,3)

A<-1/max (Eigen (t (P)%*%p) $values)

max<-1000

E<-rep (0,150)

Eps<-rep (0,1000)

i<-0

For (i in 1:max) {

v<-w%*%p;

y<-v;

e<-d-y;

Eps[i+1]<-sum (e^2)/length (e)

w<-w+a* (d-y)%*%t (p);

if (I==max)

Print (W)

}

Get Classified Line:



Much better than the perceptron classification, the reason is that the transfer function is changed from a two value threshold function to a linear function, which is the delta rule we mentioned earlier converges to the best approximation of the target concept. The increment rule asymptotically converges to the minimum error hypothesis, which may take an infinite amount of time, but will converge regardless of whether the training sample is linear or not.

To understand this, we consider the classification of two types of flowers after iris data (here we look at the first two categories), using the Perceptron:



Using a linear classifier:



But the point to be explained is that convergence does not mean that the classification works better, the need to solve linear irreducible problems is to add nonlinear inputs or increase neurons. We illustrate this with an example of Minsky & Papert (1969).



Using a linear neural network, the code is exactly the same as above, slightly.

Output of the first neuron:

Weight value: [, 1] [, 2] [, 3]

[1,] 0.75 0.5-0.5

Test: [, 1] [, 2] [, 3] [, 4]

[1,] 1 0 1 1

Output of the second neuron:

Weight value: [, 1] [, 2] [, 3]

[1,] 0.75-0.5 0.5

Test: [, 1] [, 2] [, 3] [, 4]

[1,] 1 1 0 1

Solve XOR logic (same fetch 0, different fetch 1) with result: (Code XOR (C (1,0,1,1), C (1,1,0,1)))

[1] False True True false

namely 0,1,1,0, the classification is correct.

Finally, the delta rules can only train a single-layer network, but this does not have a significant impact on its functionality. In theory, multilayer neural networks are no more powerful than single-layer neural networks, and they have the same capabilities.

Three, BP neural network 1, sigmoid function classification

Recalling the perceptron we mentioned earlier, it uses an indication function as the method of classification. However, the sigmoid function as the classifier its jumping point is very difficult to deal with, fortunately, y=1/(1+e^-x) has a similar nature, and has a good quality of smoothness. We can see the image of the sigmoid function through:



sigmoid function has the advantages of low computational cost, easy to understand and realize, but also has the characteristics of less fitting and less accurate classification, we can see the poor classification result of sigmoid function in the support vector machine chapter.

2. BP neural network structure

BP (back propagation) neural network, the learning process of the inverse propagation algorithm of error inversion error, consists of two processes: forward propagation of information and reverse propagation of error. It is known that the BP Neural network is a three-layer network:



Input layer: Each neuron in the input layer is responsible for receiving input from the outside and passing it to the middle layer neurons;

Hidden layers (Hidden layer): The middle tier is the internal information processing layer, which is responsible for the data transformation, the middle layer can be designed as a single hidden layer or a multi-hidden layer structure according to the demand of the information change ability, and the last one is transmitted to the output layer of the neuron information, after further processing, the completion of a learning process ;

Output layer: As the name implies, the output layer outputs information processing results to the outside world;

When the actual output does not match the expected output, enter the reverse propagation phase of the error. Error through the output layer, according to the error gradient degradation of the weight of each layer, to the hidden layer, the input layer by layer-to-pass. The cycle of information forward propagation and error reverse propagation process, is the process of continuous adjustment of the weights of each layer, is also the process of neural network learning training, the process has been carried out to the network output error reduced to acceptable degree, or pre-set the number of learning.

3. Reverse propagation algorithm

The inverse propagation algorithm extends the analysis of the Delta rules we mentioned earlier to a neural network with hidden nodes. To understand the problem, suppose Bob told Alice a story, and Alice gave it to ted,ted to examine the truth and found the story wrong. Now Ted needs to find out which errors are caused by Bob and which are attributed to Alice. When the output node obtains input from the hidden node, the network discovers the error, the weight coefficient adjustment needs an algorithm to find out how many different nodes caused by the whole error, the network needs to ask, "Who let Me Go astray?" At this point, what should the network do?

Also derived from the gradient descent principle, the only difference in weight coefficient adjustment analysis is the difference between T (p,n) and Y (p,n). Generally speaking, the change of WI is:

Alpha * s ' (A (p,n)) * d (N) *x (p,i,n)

where d (n) is the function of hiding node n, let's see:

n How much influence the output node is given;

The output node itself has a lot of influence on the overall error of the network.



On the one hand, the more n affects an output node, the more the error of the network overall is caused by N. On the other hand, if the output node affects the overall network error less, the effect of n on the output node is reduced correspondingly. Here D (j) is the base value for the overall error of the network, W (N,J) is the effect of N on J, and D (J) * W (N,J) is the sum of the two effects. But n almost always affects multiple output nodes, which may affect each output node, so that D (n) can be represented as: SUM (D (j) *w (N,J))

Here j is an output node that gets input from N, and we get a training rule.

Part 1th: The weight coefficient changes between the hidden node n and the output node J, as follows:

Alpha *s ' (A (P,n)) * (t (p,n)-y (p,n)) * X (P,N,J)

Part 2nd: The weight coefficient changes between the input node I and the output node n, as follows:

Alpha *s ' (A (p,n)) * SUM (d (J) * W (N,J)) * X (P,i,n)

Each output node J that receives input from N is different here. The basic situation with respect to the inverse propagation algorithm is roughly the same.

The 1th part is usually referred to as forward propagation, and the 2nd part is called reverse propagation. The name of the reverse propagation comes from this.

4. The steepest descent method and its improvement

The basic idea of the steepest descent method is that to find the minimum value of a function, the best way is to explore it along the gradient direction of the function, and if the gradient is in D, then the iterative formula can be written as w=w-alpha*d, where alpha can be understood as the learning rate we mentioned earlier.

The steepest descent method has a slow convergence rate (because each search is orthogonal to the previous one, the convergence is jagged), easy to fall into the local minimum value and other shortcomings, so he has a lot of improvement methods, the most common is to increase momentum and learning rate variable.

Increase Impulse entry (Momentum)

Modify the weights update rule so that the update of weights at Nth iterations is partially dependent on the update occurring at the time of the n‐1 iteration

Delta (W) (n) =-alpha* (1-MC) *delta (W) (n) +mc*delta (W) (n-1)

The first item on the right is the weight update rule, and the second is called the impulse term.

A gradient-descending search trajectory is like a ball rolling along an error surface, and the impulse causes the ball to scroll in the same direction from one iteration to the next.

The impulse sometimes causes the ball to roll over the local minimum or flat area of the error surface.

The impulse also has the effect of increasing the search step gradually in the invariant region of the gradient, thus accelerating the convergence.

Change the learning rate

When the error decreases towards the target, it is indicated that the correction direction is correct and the learning rate can be increased; When the error increases more than one range, the change is incorrect and the learning rate needs to be reduced.

5. Realization of BP neural network

(1) Data read, here we still use R's built-in data-iris data, because the neural network is essentially 2 classification, so we will be the iris data is divided into two categories (the first two categories are considered the 2nd category), according to the characteristics: petal length and width of the classification.

(2) Dividing training data and testing data

(3) Initialize the BP network, using a neural network containing a hidden layer, the training method uses the steepest descent method containing the momentum, and the transfer function uses the sigmoid function.

(4) Input sample, normalized the sample, calculate the error, solve the error squared sum

(5) Determine whether convergence

(6) Adjust the weights according to the error. Weights are adjusted according to the following formula:

Delta (W) = Alpha *s ' (A (P,n)) * (t (p,n)-y (p,n)) * X (P,N,J)

Where Alpha is the learning rate, s ' (A (P,n)) * (t (p,n)-y (P,n)) is a local gradient. In addition, due to the use of a momentum factor of the steepest descent method, in addition to the first time, the following changes should be:

Delta (W) (n) =-alpha* (1-MC) *delta (W) (n) +mc*delta (W) (n-1)

(7) test, output classification accuracy rate.

Complete R-code:

Iris1<-as.matrix (Iris[,3:4])

Iris1<-cbind (Iris1,c (Rep (1,100), Rep (0,50)))

Set.seed (5)

N<-length (iris1[,1])

Samp<-sample (1:N,N/5)

Traind<-iris1[-samp,c (+)]

train1<-iris1[-samp,3]

Testd<-iris1[samp,c (+)]

test1<-iris1[samp,3]

Set.seed (1)

ntrainnum<-120

Nsampdim<-2

Net.nin<-2

Net.nhidden<-3

Net.nout<-1

W<-2*matrix (runif (net.nhidden*net.nin) -0.5,net.nhidden,net.nin)

B<-2* (runif (Net.nhidden)-0.5)

Net.w1<-cbind (W,B)

W<-2*matrix (runif (net.nhidden*net.nout) -0.5,net.nout,net.nhidden)

B<-2* (runif (net.nout)-0.5)

Net.w2<-cbind (W,B)

Traind_s<-traind

Traind_s[,1]<-traind[,1]-mean (traind[,1])

Traind_s[,2]<-traind[,2]-mean (traind[,2])

TRAIND_S[,1]<-TRAIND_S[,1]/SD (traind_s[,1])

TRAIND_S[,2]<-TRAIND_S[,2]/SD (traind_s[,2])

Sampinex<-rbind (t (traind_s), Rep (1,ntrainnum))

Expectedout<-train1

eps<-0.01

a<-0.3

mc<-0.8

maxiter<-2000

iter<-0

Errrec<-rep (0,maxiter)

Outrec<-matrix (Rep (0,ntrainnum*maxiter), Ntrainnum,maxiter)

Sigmoid<-function (x) {

Y<-1/(1+exp (-X))

Return (y)

}

For (i in 1:maxiter) {

hid_input<-net.w1%*%sampinex;

Hid_out<-sigmoid (Hid_input);

Out_input1<-rbind (Hid_out,rep (1,ntrainnum));

out_input2<-net.w2%*%out_input1;

Out_out<-sigmoid (OUT_INPUT2);

Outrec[,i]<-t (out_out);

err<-expectedout-out_out;

Sse<-sum (err^2);

errrec[i]<-sse;

iter<-iter+1;

if (sse<=eps)

Break

Delta<-err*sigmoid (out_out) * (1-sigmoid (out_out))

delta<-(Matrix (net.w2[,1: (Length (net.w2[1,])-1)))%*?lta*sigmoid (hid_out) * (1-sigmoid (hid_out));

Dwex<-delta%*%t (OUT_INPUT1)

Dwex<-delta%*%t (Sampinex)

if (i==1) {

net.w2<-net.w2+a*dwex;

net.w1<-net.w1+a*dwex;

}

else{

net.w2<-net.w2+ (1-MC) *a*dwex+mc*dwexold;

net.w1<-net.w1+ (1-MC) *a*dwex+mc*dwexold;

}

dwexold<-dwex;

dwexold<-dwex;

}

Testd_s<-testd

Testd_s[,1]<-testd[,1]-mean (testd[,1])

Testd_s[,2]<-testd[,2]-mean (testd[,2])

TESTD_S[,1]<-TESTD_S[,1]/SD (testd_s[,1])

TESTD_S[,2]<-TESTD_S[,2]/SD (testd_s[,2])

Inex<-rbind (t (testd_s), Rep (1,150-ntrainnum))

Hid_input<-net.w1%*%inex

Hid_out<-sigmoid (Hid_input)

Out_input1<-rbind (Hid_out,rep (1,150-ntrainnum))

Out_input2<-net.w2%*%out_input1

Out_out<-sigmoid (OUT_INPUT2)

Out_out1<-out_out

out_out1[out_out<0.5]<-0

Out_out1[out_out>=0.5]<-1

Rate<-sum (Out_out1==test1)/length (test1)

The correct rate of classification is: 0.9333333, is a good learning device. Note here is the momentum Factor MC selection, MC can not be too small, otherwise easy to fall into the local minimum and out, in this case, if mc=0.5, the correct rate of classification is only: 0.5333333, learning effect is not ideal.

The neural network function in R

The Nnet function of a single-layer forward neural network model in package nnet is called:

Nnet (formula,data, weights, size, Wts, linout = f, entropy = f,

Softmax = f, skip = f, rang = 0.7,decay = 0, maxit = 100,

trace = T)

Parameter description:

Size, hidden layer node points;

Decay, indicating that the weights are decremented (can prevent overfitting);

Linout, linear output unit switch;

Skip, whether to allow skipping the hidden layer;

Maxit, the maximum number of iterations;

Hess, whether output Hessian value

The method for neural network is predict,print and summary, nnethess function is used to calculate the Hessian matrix under the weight parameter, and the test is local minimum.

We use the Nnet function to analyze vehicle data. Randomly selecting half of the observations as the training set, and the rest as the test set, constructs a neural network with only a hidden layer of 3 nodes. Enter the following program:

Library (nnet); #安装nnet软件包

Library (Mlbench); #安装mlbench软件包

Data (Vehicle); #调入数据

N=length (vehicle[,1]); #样本量

Set.seed (1); #设随机数种子

Samp=sample (1:N,N/2); #随机选择半数观测作为训练集

B=class.ind (Vehicle$class); #生成类别的示性函数

Test.cl=function (true,pred) {True<-max.col (true); Cres=max.col (pred); table (True,cres)};

A=nnet (vehicle[samp,-19],b[samp,],size=3,rang=0.1,decay=5e-4,maxit=200); #利用训练集中前18个变量作为输入变量, the hidden layer has 3 nodes, the initial random weights in [ -0.1,0.1], the weights are gradually attenuated.

test.cl (B[samp,],predict (a,vehicle[samp,-19)) #给出训练集分类结果

test.cl (B[-samp,],predict (a,vehicle[-samp,-19)); #给出测试集分类结果

#构建隐藏层包含15个节点的网络. The above statement then enters the following program:

A=nnet (vehicle[samp,-19],b[samp,],size=15,rang=0.1,decay=5e-4,maxit=10000);

test.cl (B[samp,],predict (a,vehicle[samp,-19));

test.cl (B[-samp,],predict (a,vehicle[-samp,-19));

Look at the handwritten numbers case again

Finally, we went back to the first handwritten figure case, and we tried to redo the case using a support vector machine. (The description and data of this case refer to the R language and machine learning Note (classification algorithm) (1))

Since the Nnet package has a certain limit on the number of dimensions entered (I do not know why, there may be some bugs in the weight calculation, anyway the support vector machine that section of the code is moved parallel to the error). We use the common approach of handwritten numeral recognition technology to deal with this case: to calculate the characteristics of a number. There are many ways to choose a digital feature, and you can have a description of it in a paper that you casually Baidu. We use the combination of structural features and statistical features to calculate the characteristics of the image.



The statistical features we use here are a bit different (the structural features are consistent), we divide the picture into 16 (4*4) and count the midpoint of each small square, so that we have a 25-dimensional eigenvector. In order to ensure the comparability of results, we also report the classification results of support vector machines.

Run the following code:

SETWD ("D:/r/data/digits/trainingdigits")

Names<-list.files ("D:/r/data/digits/trainingdigits")

Data<-paste ("Train", 1:1934,sep= "")

For (I-in 1:length (names))

Assign (Data[i],as.matrix (READ.FWF (Names[i],widths=rep (1,32)))

Library (nnet)

Label<-factor (Rep (0:9,c (189,198,195,199,186,187,195,201,180,204)))

Feature<-matrix (Rep (0,length (names) *25), Length (names), 25)

For (I-in 1:length (names)) {

Feature[i,1]<-sum (Get (Data[i]) [, 16])

Feature[i,2]<-sum (Get (Data[i]) [, 8])

Feature[i,3]<-sum (Get (Data[i]) [, 24])

Feature[i,4]<-sum (Get (Data[i]) [16,])

Feature[i,5]<-sum (Get (Data[i]) [11,])

Feature[i,6]<-sum (Get (Data[i]) [21,])

Feature[i,7]<-sum (Diag (Get (data[i)))

Feature[i,8]<-sum (Diag (Get (Data[i]) [, 32:1]))

Feature[i,9]<-sum ((Get (Data[i]) [17:32,17:32])

Feature[i,10]<-sum ((Get (Data[i]) [1:8,1:8])

Feature[i,11]<-sum ((Get (Data[i]) [9:16,1:8])

Feature[i,12]<-sum ((Get (Data[i]) [17:24,1:8])

Feature[i,13]<-sum ((Get (Data[i]) [25:32,1:8])

Feature[i,14]<-sum ((Get (Data[i]) [1:8,9:16])

Feature[i,15]<-sum ((Get (Data[i]) [9:16,9:16])

Feature[i,16]<-sum ((Get (Data[i]) [17:24,9:16])

Feature[i,17]<-sum ((Get (Data[i]) [25:32,9:16])

Feature[i,18]<-sum ((Get (Data[i]) [1:8,17:24])

Feature[i,19]<-sum ((Get (Data[i]) [9:16,17:24])

Feature[i,20]<-sum ((Get (Data[i]) [17:24,17:24])

Feature[i,21]<-sum ((Get (Data[i]) [25:32,17:24])

Feature[i,22]<-sum ((Get (Data[i]) [1:8,25:32])

Feature[i,23]<-sum ((Get (Data[i]) [9:16,25:32])

Feature[i,24]<-sum ((Get (Data[i]) [17:24,25:32])

Feature[i,25]<-sum ((Get (Data[i]) [25:32,25:32])

}

Data1 <-data.frame (Feature,label)

M1<-nnet (label~.,data=data1,size=25,maxit = 2000,decay = 5e-6, rang = 0.1)

Pred<-predict (m1,data1,type= "class")

Table (Pred,label)

SUM (diag (table (Pred,label)))/length (names)

Library ("e1071")

M <-SVM (feature,label,cross=10,type= "C-classification")

M

Summary (M)

Pred<-fitted (M)

Table (Pred,label)

SETWD ("D:/r/data/digits/testdigits")

Name<-list.files ("D:/r/data/digits/testdigits")

Data1<-paste ("Train", 1:1934,sep= "")

For (I-in 1:length (name))

Assign (Data1[i],as.matrix (READ.FWF (Name[i],widths=rep (1,32)))

Feature<-matrix (Rep (0,length (name) *25), Length (name), 25)

For (i in 1:length (name)) {

Feature[i,1]<-sum (Get (Data1[i]) [, 16])

Feature[i,2]<-sum (Get (Data1[i]) [, 8])

Feature[i,3]<-sum (Get (Data1[i]) [, 24])

Feature[i,4]<-sum (Get (Data1[i]) [16,])

Feature[i,5]<-sum (Get (Data1[i]) [11,])

Feature[i,6]<-sum (Get (Data1[i]) [21,])

Feature[i,7]<-sum (Diag (Get (data1[i)))

Feature[i,8]<-sum (Diag (Get (Data1[i]) [, 32:1]))

Feature[i,9]<-sum ((Get (Data1[i]) [17:32,17:32])

Feature[i,10]<-sum ((Get (Data1[i]) [1:8,1:8])

Feature[i,11]<-sum ((Get (Data1[i]) [9:16,1:8])

Feature[i,12]<-sum ((Get (Data1[i]) [17:24,1:8])

Feature[i,13]<-sum ((Get (Data1[i]) [25:32,1:8])

Feature[i,14]<-sum ((Get (Data1[i]) [1:8,9:16])

Feature[i,15]<-sum ((Get (Data1[i]) [9:16,9:16])

Feature[i,16]<-sum ((Get (Data1[i]) [17:24,9:16])

Feature[i,17]<-sum ((Get (Data1[i]) [25:32,9:16])

Feature[i,18]<-sum ((Get (Data1[i]) [1:8,17:24])

Feature[i,19]<-sum ((Get (Data1[i]) [9:16,17:24])

Feature[i,20]<-sum ((Get (Data1[i]) [17:24,17:24])

Feature[i,21]<-sum ((Get (Data1[i]) [25:32,17:24])

Feature[i,22]<-sum ((Get (Data1[i]) [1:8,25:32])

Feature[i,23]<-sum ((Get (Data1[i]) [9:16,25:32])

Feature[i,24]<-sum ((Get (Data1[i]) [17:24,25:32])

Feature[i,25]<-sum ((Get (Data1[i]) [25:32,25:32])

}

Labeltest<-factor (Rep (0:9,c (87,97,92,85,114,108,87,96,91,89)))

Data2<-data.frame (Feature,labeltest)

Pred1<-predict (m1,data2,type= "class")

Table (Pred1,labeltest)

SUM (diag (table (pred1,labeltest)))/length (name)

Pred<-predict (M,feature)

Table (Pred,labeltest)

SUM (diag (table (pred,labeltest)))/length (name)

After finishing, we have the following output results:



It can be seen that the neural network and support vector machine still have some comparability, but the result of support vector machine is better than the neural network.

Here we have a neural network to take 25 nodes (hidden layer) seems to have the phenomenon of overfitting (although not too serious) we should reduce the number of nodes to get better prediction results.

The choice of nodes is an experience, we don't have a certain rule. Can try a few more times, combined with the correct rate of training set and test set the correct rate of comprehensive analysis, but the cost of constructing a neural network is high, so there is a less bad result can be stopped. (The same is true for other parameters, but not as important as size)

The selection of features is very important to identify the problem, perhaps the main component in the selection of features will be better than our choice, but the cost is also higher, and how we should choose the main component, how to choose (Which one to choose the main component of the graph) is to be considered.

Five, neural network or support vector machine

As you can see from the above narrative, there are many similarities between neural networks and the support vector machines we have previously said, so who should we choose? Here is a concise comparison of the two methods:

–SVM's theoretical foundation is more solid than NN, more like a rigorous "science" (three elements: problem representation, problem solving, proof)

–svm--rigorous mathematical reasoning

–ann--strongly relies on engineering skills

– The ability to promote is dependent on the experience risk value and the confidence range value, and Ann cannot control either of the two.

–ann designers use superb engineering techniques to compensate for the mathematical flaws-designing special structures, using heuristic algorithms, and sometimes getting surprisingly good results.

As Feynman points out, "we must clarify from the outset that if something is not science, it is not necessarily bad." Love, for example, is not science. Therefore, if we say that something is not science, it is not that it is wrong, but that it is not science. "Compared to SVM, Ann is not a science, more like an engineering skill, but it does not mean that it must be bad."

Furtherreading:

Neural Network Overview: Introducing neural networks in normal language

Steepest descent Method: Xiazdong: Introduction to Machine learning: linear regression and gradient descent

Neural network structure: Pe. Panyi: BP neural network theory (a very good article, blog of the BP neural network structure from the blog)

Neural Network advanced: Broadsword: Machine Learning Beginner Learning (1): RBF neural Network minimalist introduction and its algorithm R language implementation

Article reprinted from: http://www.52analysis.com/R/1627.html

R Language Neural Network algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.