Deep Learning (Next) __ Convolution neural network

Last Update:2018-08-20 Source: Internet

Author: User

Tags image to text

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Convolution Neural network

Convnets is used to process data with multiple array formats, such as a color image consisting of three two-dimensional arrays, which contains pixel intensities on three color channels. Many data forms are in the form of multiple arrays: one-dimensional signals and sequences, including languages; Two-dimensional image or audio spectrum, three-dimensional video or stereo image. Convnets has four key ideas that take advantage of the nature of the signals: local connections, shared weights, pools, and multiple layers.

A typical convnet (Figure 2) architecture is constructed into a series of phases. The first several phases consist of two types of layers: the convolution layer and the pool layer. The units of the convolution layer are organized by feature graphs, each of which is connected to the local patches of the previous feature graph through a set of weights called filter banks. Then this local weighting and the result of non-linear transfer such as Relu. All cells in the feature map share the same filter banks. Different feature graphs in the same layer use different filter groups. The rationale for this design structure is twofold. First, in the array data (such as images) local
Values are usually highly correlated, and they form local patterns that are clearly and easily detectable. Second, the local statistics of images and other signals are invariant to position. In other words, if a pattern can appear in an image, it can appear anywhere, so we come to the idea that units in different locations share the same weights and detect the same pattern in different locations in the array. In mathematics, the filtering operation performed on a feature map is a discrete convolution, and hence its name.

Although the function of the convolution layer is to detect the local connection of the previous feature, the function of the pool layer is to merge semantically similar features. Because the relative position of the form pattern feature can be different, the position of each feature can be obtained by coarse grained to obtain the reliable detection pattern. A typical pooled unit calculates the maximum value of a local small block (patch) element in a feature graph (or several feature graphs). A shift block (patch) a multiline or column-derived small block (patch) is used as an input to a neighboring pool unit, reducing the number of dimensions represented and creating the invariance of small shifts and deformations (invariance to small shifts and distortions). The convolution of two or three layers, the nonlinearity and the pooling are stacked, followed by the convolution and the fully connected layer. The convnet of the reverse propagation gradient is as simple as the deep network through the rules, allowing the training of ownership values of all filter groups.

The Deep neural network utilizes many natural signals to be the properties of the composite hierarchy, in which the high-level features are obtained by synthesizing the lower layer features. In an image, a local combination of edges forms a pattern (modifs), a pattern is assembled into a component (parts), and a component is formed into an object (objects). Similar hierarchies also exist for voice and text from telephone sounds, phonemes, syllables, words and sentences. When the position and appearance of the current layer of elements change, pooling allows for a small change in representation.

The convnets of the convolution layer and the pool layer are inspired by the classical concepts of simple cells and complex cells in the visual cortex, and the whole structure is the memory of the lgn-v1-v2-v4-it level in the ventral pathway of the visual cortical (reminiscent). When the convnet model and monkeys were presented with the same image, the activation of the upper convnet in the middle of the monkey showed half the variance of 160 neurons in the temporal cortex. Convnets comes from the new cognitive machine, its structure is somewhat similar, but there is no end-to-end supervised learning algorithm, such as reverse propagation. The original one-dimensional convnet, called a delayed neural network, is used for the identification of phonemes and simple words.

Many of the applications of convolution networks can be traced back to the early 90, when a delayed neural network was first used to identify speech and document reading. The document reading system trains convnet using a probabilistic model that implements language limitations. By the late 1990s the system had read more than 10% of America's cheques. Some optical character recognition and handwriting recognition systems based on Convnet are developed by Microsoft. In the early 90 convnets also carried out experiments on the detection of target in natural images, including face and hand detection and face recognition. image of deep convolution network understanding

Since the beginning of this century, Convnets has been very successful in the detection, segmentation and recognition of targets and regions in images. The data labeled in these tasks are relatively rich, such as traffic sign recognition, biological images, especially connectomics segmentation, and the detection of human faces, characters and pedestrians in natural images. One of the major successful practices of recent convnets is face recognition.

Importantly, images can be labeled at pixel level, which will have technical applications, including automatic mobile robots and self-driving cars. Both Mobileye and Nvidia use a convnet based approach in their upcoming automotive Vision systems. Other increasingly important applications relate to natural language understanding and speech recognition.

Despite these achievements, Convnets was largely abandoned by the mainstream computer vision and machine learning community until the Imagenet race in 2012. When the deep convolution network was applied to datasets containing nearly 1 million images in 1000 different categories, they achieved remarkable results, which reduced the error rate by half compared to the best competitive methods. This success comes from the efficient use of the GPU, Relus, a new regularization method named Dropout, which produces more examples by morphing existing training samples. This success has brought about a revolution in computer vision; now Convnets has taken a dominant position in almost all of the identification and testing tasks and is close to human performance on a number of tasks. A recent stunning demonstration combines a convnets and a recursive network to generate an image description (Figure 3).

Figure 3 | from image to text. The title generated by the Recurrent Neural Network (RNN) is extracted from a test image by convolution neural Network (CNN), RNN the top of the image to "translate" into text (top). When RNN gives the focus ability to give different positions in the input image (the middle and bottom; the brighter areas are given greater attention), as it produces each word (bold), we find it better to "translate" the image into subtitles.

The recent Convnet architecture has 10 to 20 layers of relus, with billions of weights and hundreds of billions of connections between the units. While training such a large network took two weeks two years ago, training time has been reduced to several hours as hardware, software and algorithms progress in parallel.

Convnet's visual system performance has created most of the major technology companies, including Google, Facebook, Microsoft, IBM, Yahoo, Twitter and Adobe, and more and more initiate research, Develop projects and deploy start-ups based on convnet image understanding products and services.

Convnets is easily subject to efficient hardware implementations on chip or field programmable gate arrays. Many companies such as Nvidia,mobileye, Intel, Qualcomm and Samsung are developing convnet chips to implement real-time visual applications on smartphones, cameras, robots and self-driving cars. distributed representation and language processing

Depth learning theory shows that deep networks have two different exponential advantages over classical algorithms that do not use distributed representations. Both of these advantages appear in the power of the combination and depend on the distribution of the underlying generation data, which has the proper composition structure. First, learning distributed representations during training can generate generalizations for combinations that have not seen the eigenvalues (for example, n binary features may have 2^n combinations). Second, the presentation layer that makes up the depth network brings potential to another exponential advantage (the index in depth).

The hidden layers of multilayer neural networks represent the input of the network in a way that is easier to predict the output of the network target. It can be well explained by training multilayer neural networks to predict an instance of the next word in a sequence. Each word in the context is entered into the network as one of the n-dimensional vectors, i.e. one component is 1 while the remainder is 0. On the first level, each word produces a different activation pattern or word vector (Figure 4). In a language model, the other layers of the network convert the input word vector into an output word vector to predict the next word, which can be used to predict the probability of any word that may appear next. The network learns a word vector that contains many active components, each of which can be interpreted as an independent feature of the word, as well as a distributed representation of learning symbols. These semantic features do not explicitly represent input, and the learning process of decomposing the structural relationship between input and output symbols into multiple "micro-rules" can unearth semantic features. When the word sequence comes from a corpus of large real text and the micro rule is unreliable, learning the word vector effect is also very good. When a new story is trained to produce the next word, for example, the word vectors of Tuesday and Wednesday are very similar, as are Sweden and Norway, where such representations are called distributed representations because their elements (features) are not mutually exclusive, and many of their configurations correspond to the changes observed in observational data. These word vectors are not determined by the experts in advance, but are the characteristics of the automatic discovery of neural networks. In other words, the vector representation from the text is now widely used in natural language applications.

The problem is expressed in the cognitive paradigm between the heuristic logic (logic-inspired) and the Heuristic Neural Network (neural-network-inspired). In the heuristic logic paradigm, a symbol instance has a unique attribute that is either the same or different from the other symbol instances. It has no internal structure associated with it; In order to interpret symbols, they must be bound to a variable with a judicious inference selection rule. In contrast, neural networks perform fast "intuitive" inferences with large activation vectors, large weights of matrices and nonlinear scalars, which support easy commonsense reasoning.

Before introducing the neural language model, the standard method of linguistic statistical modeling did not develop a distribution representation: It was based on the frequency that occurred in a short symbolic sequence of length n (called N-gram). The possible number of N-gram is about VN, where V is the vocabulary size, so a very large training corpus is needed for a small number of lexical counts. N-gram each word as an atomic element, so they cannot generalize about the sequence of semantics, and the neural language model can, because they associate each word with the actual value eigenvector, and the semantically related words in the vector space end up together (Figure 4).

The graph 4| the word vector visible to the chemical. On the left is the word that is learned from the modeling language, in order to visualize using the T-sne algorithm non-linear mapping to 2D. On the right is a 2D phrase that is learned from English to French encoder-decoder recursive neural networks. You can see semantically similar words or sequences of words that map to neighboring representations. The distributed representation of a word is obtained by using reverse propagation to the common learning of each word and feature representation, wherein the feature predicts the target quantity as the next word (language model) of a sequence or the translation of the entire sequence (machine translation). Recurrent neural Network

When reverse propagation is first entered, its most exciting use is the training of recurrent neural networks (Rnns). For tasks that involve continuous input, such as speech and language, the effect of using Rnns (Figure 5) is relatively good. Rnns one element at a time for the input sequence, preserving a ' state vector ' in their hidden layer, which implicitly contains the historical information of all the past elements. When we consider the output of the hidden layers on different discrete time steps (as if they were the output of different neurons in a deep multilayer network (Fig. 5, right), how we use reverse propagation to train Rnns becomes very clear.

Figure 5 | Recursive neural networks and the timely deployment of participatory forward computing. Artificial neurons (for example, the hidden elements clustered under s, and the time t is the value of the node S is St) are obtained from the previous time steps of other neurons (such as the left side, represented by a black square, representing a delay of a time step). In this way, the recursive neural network can map the input sequence (the element is XT) to an output series (the element is OT), each OT dependent on the previous XT ' (T ' ≤t). The same parameters (Matrix u,v,w) are used for each time step. Many other structures are possible, considering a variant in which the network can produce output sequences (for example, words), each of which is used as input for the next step. In the BP algorithm (Fig. 1), a computed graph of the right expansion network can be used to calculate the derivative of the total error for all State St and parameter (for example, the logarithmic probability of generating the right output sequence).

Rnns is a very powerful dynamic system, but training them is problematic, because the gradient of the reverse propagation can also grow or shrink at each step, so many times they usually proliferate or disappear.

Because of the progress of architecture and training, Rnns works well when predicting the next word in a character or sequence of text, but they can also be used for more complex tasks. For example, after reading an English sentence in one word, you can train an English ' coded ' network so that the final state vector of its hidden layer is the representation of the thought expressed in the whole sentence. This vector of thought (thought vector) can be used as an initial implicit state of the French "decoder" network, which outputs the probability distribution of the first word in French translation. If a particular first word is selected from this distribution and used as input to the decoder network, then it outputs the probability distribution of the second word translation, until it ends. Generally speaking, according to the probability distribution of English sentences, this process produces a series of legal words and phrases. This rather simple machine translation has rapidly become the country's most advanced competitive rivals, which raises serious doubts about whether a sentence needs something like internal symbolic expression, in which expressions are controlled by inference rules. It is compatible with the idea that day-to-day reasoning involves many simultaneous analogies (which help to draw conclusions).

Instead, translating French into English, you can learn the meaning of "translating" images into English (Figure 3). The encoder here is a deep convnet that converts pixels into the activation vectors of the last hidden layer. A decoder is a rnn similar to a machine translation and a neural language model. Recently some people have been interested in such a system.

Rnns (Figure 5) can be viewed as a very deep feedforward network in which all layers share the same weights. While their primary objective is to learn about long-term dependence, theoretical and empirical evidence suggests that it is difficult to learn to store information over a long period of time.

To correct this, the idea is to use an explicit store to increase the network. The first recommendation of this type is the long short term storage (short-term memory,lstm) network, which uses a special hidden layer whose natural behavior is long time record input. A special unit called a Memory cell (cell) is like a battery or gated leaky neuron (gated leaky neuron): it connects to itself in the next step, so it replicates its own real state and accumulates an external signal, but the self connection and the other unit are closed (this Self-connecton is multiplicatively gated by the another unit, which is in ignorance of the instruction), whose learning determines when the contents of the memory are cleared.

The LSTM network was later proved to be more effective than conventional rnns, especially when they had several layers in each time step, making the entire speech recognition system possible, which was completely consistent from acoustics to character sequence transcription. LSTM network or related forms of gated units are also used in encoders and decoder networks, and the network performs well in machine translation.

Over the past year, several scholars have proposed different proposals to strengthen the Rnns of memory modules. These recommendations include a neural Turing machine (in which the network is enhanced by a "banded" (tap-like) memory, RNN can optionally read or write to that memory), and the storage network (in which the network is enhanced by a combination of memory). The storage network has achieved good performance on the standard troubleshooting benchmark (standard question-answering benchmarks). This store is used to record the content of the network answer question.

In addition to simple memories, neural Turing and storage networks are being used for tasks that usually require reasoning and symbolic manipulation. Neural Turing can teach "algorithms". In other matters, when their input includes an unordered sequence in which each symbol accompanies a real value that represents the precedence in the list, they can learn the sorted list of output symbols. Storage networks can track the state of the world in an environment similar to text-adventure games, and after reading a story, they can answer questions that require complex inferences. In a test case, show the network 15 versions of The Lord of the Rings (15-sentence version, do not know what the meaning of the expression, qu), it can correctly answer such as "Frodo now where." "The question. the future of deep learning

Unsupervised learning plays a catalytic role in the revival of deep learning, but has been overshadowed by the success of purely supervised learning. Although we are not focused on it in this review, we anticipate that unsupervised learning will become more important in the long run. Human and animal learning is largely unsupervised: we discover the structure of the world by observing the world, rather than being told the names of each object.

Human vision is an active process that uses an intelligent, mission-specific way to sample an optical array. The human retina consists of small but high-resolution grooves in the middle and large but low-resolution parts surrounding them. We expect future visual developments to come from this system, which will be end-to-end trained and combined with Rnns convnets (using reinforcement learning to decide where to look). Systems that combine deep learning with intensive learning are still in their infancy, but they have outperformed the passive vision system in classified tasks and have produced impressive results in learning to play different video games.

Another area in which deep learning is expected to have a great impact in the next few years is natural language understanding. We anticipate that a system that uses Rnns to understand a sentence or an entire file will do better if it chooses to focus on a part of the time.

Finally, the artificial intelligence will make great progress after realizing the system combining learning representation and complex inference. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, the new paradigm requires operations on large vectors to replace rule-based symbolic expression operations.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More