New training Method--using iterative projection algorithm to train neural network

Source: Internet
Author: User
Tags diff new set terminates

Author's introduction : Jesse Clark

Physicists and data scientists who study phase recovery have extensive experience in building websites and designing mobile applications, and have a great passion for entrepreneurship.

Github: https://github.com/jn2clark

Linkedin: http://www.linkedin.com/in/j3ss3cl4rk

Phase Recovery (PR) The concern is to find the phase [1]of the complex value function (usually in Fourier space) given the amplitude information and constrained by the real space.

PR is a non-convex optimization problem, has become a lot of work [1,2,3,4,5,6,9] theme, and become the backbone of crystallography, is the backbone of structural biology.

Shown below is an example of the PR reconstruction process, showing the 3D dispersion data (Fourier amplitude) to reconstruct the real space 3D density of the nano-crystal [[PDF].

The successful algorithm of most PR problems is based on projection method, which is inspired by convex optimization projection to convex set [ten]. Because of the success of the projection-based method in PR, we explore whether a similar method can be used to train the neural network.

Alternate projection

Convex set Projection (POCS) is a useful way to find intersections between convex sets. The above shows a simple example where two convex constraints set C1 (red) and C2 (blue). Each collection is continuously projected by a simple iteration map to find the intersection:

where P is the projection on the respective set. The projection is idempotent PP=P, and the distance is minimized;

P (x) =y so minimal;

When the following formula is met, the solution can be found:

When the constraint set is non-convex, it is very rare to draw general conclusions. Therefore, the use of simple alternating projections may result in the stagnation of local minimum values. An example is shown below, where the collection is set to non-convex, and the ability to find the intersection (global minimum) is highly dependent on the initial guess value.

Although the set is not guaranteed to be convex, the projection method proves to be an effective method for finding a solution of non-convex optimization problem. Examples include Sudoku, n-Queens problem, graphical coloring, and phase retrieval [4,10].

Diff chart

One of the most successful non-convex projection algorithms is the difference graph (DM)[4,8], which can be written as

which

Where Y1 and Y2 are called estimates. Once the point is reached:

This means that two estimates are equivalent to a solution;

The difference graph is associated with many different algorithms [1,3,6]in the PR literature, not in the above form, as a generalization or equivalent specific hyper-parameter, and the simple version of the difference diagram is often used:

This simpler version usually behaves well and reduces the number of projections required for each iteration (the sequence of projections can also be toggled). The 2p2-i term in the formula is also known as a reflection operation, which appears in many projection algorithms [9].

The same non-convex problem is shown, but using the differential mapping algorithm will not be trapped in the local minimum value, but can escape and search for more solution space, and finally converge to a solution.

Divide and conquer algorithm

The variance chart was previously defined as two projections, so what happens when there are more than two? In this case, define a new iteration x, which is n repeated connections [ten]:

Then the average and direct product projections are defined;

Where PL is the first projection, x is weighted sum;

So many of the predicted variance graphs are

Update x:

This method is called "Divide and Conquer algorithm". The following is an iterative example of a Sudoku puzzle, which converges using the difference graph and the divide-and-conquer algorithm.

The number is unique to 4 constraints: the number of each row is 1 to 9, the number of each column is 1 to the 9,3x3 sub-square is 1 to 9, the last number is the same as the partially filled template. The code implements this example.

Projection for training a neural network

After understanding the difference diagram, projection and its application in non-convex optimization, the next step is to predict the training of neural networks. The following example considers only one classification task, and the basic idea is to look for a weight vector that correctly classifies the data into a K subset:

Defines a projection of the "projected" weights so that all training data in the sub-set is correctly categorized (or lost to 0). In fact, a gradient descent of a subset is used to achieve the projection (essentially over-fitting points). The goal is to get the weights that correctly categorize each subset of data, and to find the intersection of those collections.

Results

To test the training scheme ( code ), a small network was trained using the standard method [13] and compared to a projection-based approach. The small network uses a very simple layer, which contains approximately 22,000 parameters; 1 convolution layers, 8 3x3 filters, 2 sub-samples, 1 fully connected layers (activation function is relu), 16 nodes, and finally Softmax 10 outputs (10 classes of mnist). Use Glorot uniform[one] to initialize the weights.

Show the average training and test loss curve:

Training loss Curve

Test loss function

You can see the effect is good. Training data is divided into 3 groups of the same size, which are used for projection constraints. For projections, you need to find a new set of weights to minimize the distance from the previous set of weights. In addition, the gradient descent method is used for training, which terminates the projection once the accuracy of the training data reaches 99%. The updated weights are projected onto 3 groups to produce 3 new weights set, which are joined together to form a

The average projection can be obtained by averaging the weights, then copying and connecting to form a new vector:

The two projection steps are combined to get the update scheme for weights based on the diff diagram. In addition to the regular measurements, you can monitor the variance chart errors to find convergence. The difference mapping error is defined by the following formula:

A lower value indicates a better solution. The stability of the discrepancy graph error indicates that an approximate solution has been found. The difference graph error usually drops abruptly before it stabilizes [4], indicating that a suitable solution is found.

In the above example, the projection is defined by repeating gradient changes on a subset of the training data, which is essentially an over-fitting point. In the following example, the projection is terminated after the training data has been traversed.

The average CV test and training error are shown below (compared to the same conventional training as above)

It can be seen that this method is still possible. If the projection operation terminates prematurely, one thing you can think of is simply to treat the projection as a loose projection or a non-optimal projection. The convex optimization and the results of the PR [4,5,7,14] still indicate that the relaxation projection or non-optimal projection tends to be a better solution. In addition, in the single-traversal projection limit, the traditional training scheme based on gradient descent can be restored by alternating projection (take 3 groups as an example):

Finally, the parameter settings in the general training will have a great impact on the results of the network, the specific parameters can be viewed in the original text. Training such networks and performing early termination, the final loss and accuracy of the traditional training methods were 0.0724 and 97.5%, respectively, and the results were 0.0628 and 97.9%, respectively, using the difference graph method.

Projection Method of the Extended

One of the benefits of the projection method is that additional constraints can be easily implemented. For L1 regularization, you can define a shrink or soft-threshold operation, such as

Other projections can be the symmetry of convolution cores or the histogram constraints of weights.

Read the full text: http://click.aliyun.com/m/14997/

New training Method--using iterative projection algorithm to train neural network

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.