Deep learning--the artificial neural network and the upsurge of research
Hu Xiaolin
The artificial neural network originates from the last century 40 's, to today already 70 years old. Like a person's life, has experienced the rise and fall, has had the splendor, has had the dim, has had the noisy, has been deserted. Generally speaking, the past 20 years of artificial neural network research tepid, until the last 35 years, with the concept of deep learning, artificial neural network to regain vitality, and even once again set off the research craze. This paper briefly describes the artificial neural network's "Past Life", and briefly looks forward to its future.
The first neuron model was proposed in 1943 by McCulloch and Pitts, called threshold logic, which can implement some of the functions of logical operations. Since then, the study of neural networks has been differentiated into two directions, a process focusing on biological information processing, called biological neural network; a focus on engineering applications, called artificial neural networks. This article mainly introduces the latter. 1958 Rosenblatt presented the Perceptron (Perceptron), which is essentially a linear classifier, 1969 Minsky and Papert wrote a book "Perceptrons", which they pointed out in the book: ① Single-layer perceptron can not achieve XOR function, ② computer ability is limited, can not deal with the long-running process of neural network [1]. Given Minsky's influence in the field of artificial intelligence-one of the founders of artificial intelligence and one of the founders of the famous MIT Csail Lab-he won the Turing Award in 1969-a book that has led to a more than 10-year "winter" of research into artificial neural networks. In fact, if a single-layer perceptron is piled up into multilayer (as shown by multilayer perceptron or mlp,1), it is possible to solve linear irreducible problems, but there is no effective algorithm at that time. Although Dr. Paul Werbos, a PhD student at Harvard University in 1974, proposed a more effective BP algorithm [1], but did not arouse the attention of the academic community. It was not until 1986 that Geoff Hinton of the University of Toronto discovered the algorithm and published it in "Nature" [2] that the artificial neural network was once again valued.
Figure 1 Multi-layer sensing machine. Each neuron receives input from the underlying neuron, multiplies the corresponding weights and adds a bias,
Pass the value to the upper neuron after conversion by sigmoid function
At the same time, neural networks with feedback began to rise, with the work of Stephen Grossberg and John Hopfield the most iconic. Many complex cognitive phenomena, such as associative memory, can be simulated and interpreted using feedback neural networks. All these factors contributed to the research craze of neural networks in the 80 's. A very senior scholar in the field of neural networks talked to me and said that in those days, as long as your article had something to do with the neural network, whatever magazine it was, it was easy to publish. About 2008 years ago I attended a lecture by a Chinese scholar at Harvard University, and I forgot the name, the content is about SVM. In the course of his speech, he suddenly came to a sentence in English: I really miss the happy days when I am doing researches in neural networks. I was impressed.
But the BP algorithm is easy to get into local optimum when the number of layers of neural network increases, and it is easy to fit. In the 90 's, Vladimir Vapnik proposed SVM, although it is essentially a special two-layer neural network, but because of its efficient learning algorithm, and there is no local optimal problem, so many neural network researchers turn to SVM. The study of multilayer feedforward neural networks has gradually become deserted.
Until the 2006 deep network and deep learning concept, the neural network began to glow a new life. Deep networks, literally understood, are deep-seated neural networks. As for why the previous term "multilayer neural network" is not followed, individual guesses may be to differentiate from previous neural networks, indicating that this is a new concept. [3] This term was created in 2006 by the Geoff Hinton Research group at the University of Toronto. In fact, this deep network proposed by the Hinton Research group is structurally different from the traditional multilayer perceptron, and the algorithm is the same when it comes to supervised learning. The only difference is that the network should do unsupervised learning before doing supervised learning, and then train the weight value learned by unsupervised learning as the initial value of supervised learning. This change actually corresponds to a reasonable hypothesis. We use P (x) to denote a representation of the data that is pre-trained on the network with unsupervised learning, and then train the network with supervised learning (such as the BP algorithm) to get P (y| X), where y is the output (such as a category label). This hypothesis assumes that the study of P (X) helps P (y| X) of learning. This learning mentality helps reduce the risk of overfitting compared to purely supervised learning, because it not only learns the conditional probability distribution P (y| x), and the joint probability distributions of X and Y are also studied. There are other explanations for why pre-training is helpful for deep learning, and the most straightforward explanation is that pre-training trains the network parameters to a set of appropriate initial values, starting from this set of initial values, which results in a lower value for the cost function, but the experimental proof of Erhan and others is not necessarily the case [4]. In fact, they found that without pre-training, the network could converge to a lower error value on the training data set, but did not perform well on the test set, that is, there was a fit, as shown in 2.
Figure 21 Negative Log-likelihood (NLL) [4] of the depth network on the training set and the test set. From left to right, the network has 1, 2, 3 layers respectively. It can be seen that, while there is no pre-training process, the NLL value on the training dataset may be lower, but the NLL value on the test data set is higher
As can be seen from the above, the deep network is not a new thing in structure, its rise is mainly attributed to the change of learning method. So, what kind of learning method does the Hinton research group propose? This is to be said from the restricted Boltzmann machine (rbm,restricted Boltzmann machines).
An RBM is a single-layer stochastic neural network (usually we don't compute the input layer in the number of layers in the neural network), 3, which is essentially a probabilistic graph model. The input layer and the hidden layer are all connected, but the neurons in the layer are not connected to each other. Each neuron is either active (value 1) or inactive (value 0), and the probability of activation satisfies the sigmoid function. The advantage of RBM is that given a layer when the other layer is independent of each other, then do random sampling is more convenient, can be fixed one layer, sampling another layer, alternating. Each update of the weights theoretically requires that all neurons are sampled infinitely many times before they can be carried out, that is, the so-called contrastive divergence (CD) algorithm, but so the calculation is too slow, so Hinton and other people put forward an approximate method, only sampling n times to update the weight value, The so-called Cd-n algorithm [5].
Figure 3 Structure of an RBM
After learning an RBM model, fixed weights, and then add a new layer of hidden layer, the original RBM of the hidden layer into its input layer, so that a new RBM, and then the same method to learn its weight value. In addition, multiple RBM can be added to form a deep network (1). The weights learned by the RBM are used as the initial weights of this deep network, and then the BP algorithm is used to study them. This is the learning method of deep belief networks.
The left side of Figure 4 shows an example [6]. The network has 4 layers, compressing a high-dimensional image signal to 30 dimensions, that is, the number of neurons at the top level is 30. We can also expand the network symmetrically, from 30 to the original high-dimensional signal, so that there is a 8-layer network (see Figure 4 in the middle). If the network is used for signal compression, then the target output of the network is equal to the input, and then the weights are fine-tuned using the BP algorithm (see right of Figure 4).
Figure 41 Example of a depth belief network [6]
This work rekindled the enthusiasm of the academic community for neural networks, with a large number of outstanding academics joining the deep neural network, especially the Bengio research group at the University of Montreal and the NG Research Group at Stanford University. From the analysis of the proposed model, an important contribution of the Bengio research group is to propose a deep learning network based on the self-encoder (auto-encoder). The activation function of self-encoder and RBM is sigmoid function, and the learning principle is consistent, which can be regarded as the likelihood probability of maximizing data, but the way of realization is different. An important contribution of NG Research Group is to propose a series of deep learning networks based on sparse coding. Their work extends the definition of deep networks: in the same network, learning patterns can be different between tiers.
It is worth emphasizing that before 2006 there was also a very high learning-efficiency depth network (which may be more appropriate from a historical perspective, perhaps called a multilayer neural network)-convolutional neural networks. [7] This network was presented by Yann LeCun of New York University in 1998 and has been widely used in image classification (including handwriting recognition, traffic sign identification, etc.). For example, in the IJCNN2011-year traffic sign recognition competition, a group of Swiss researchers used a convolutional neural network approach to the jackpot. This network is essentially a multilayer perceptron (5), so why is it successful? The key to the analysis may lie in the way it uses local connections and share weights, reducing the number of weights on the one hand, and reducing the risk of overfitting. In recent years, it has been found that if convolutional neural networks carry out unsupervised learning first, then supervised learning will be more effective.
Figure 5 convolutional neural network schematic [8]
Like Hinton, LeCun is also a person who is obsessed with neural networks, and insists that almost everyone has given up on neural networks.
The renewed enthusiasm of the academic community for neural networks quickly infected the industry, and some companies with a keen sense of smell quickly followed suit. [8] In 2010, Dr. Deng Li, Victoredemont, collaborated with Hinton to discover that deep networks can significantly improve the accuracy of speech recognition. The results were further deepened by the Microsoft Research Asia, which created a number of huge neural networks, one of which contained more than 66 million neural links (6), the largest of its kind in the history of speech recognition studies. The model's recognition error rate in Switchboard standard datasets is 33% lower than the lowest error rate! You know, in speech recognition, the lowest error rate on this dataset has not been updated for many years. To this end, Dr. Deng Li also received an interview with the New York Times. Behind the scenes of the work, Microsoft Research Asia has a journal [2] on everyone's website, which is described in detail.
Figure 61 A speech recognition model that combines a deep network [8]
The Google Research Institute is also rapidly joining the study of neural networks. Working with the NG Research Group, Google's researchers set up a huge network of deep networks [9], as shown in Figure 7, which has a total of 1 billion parameters to learn and is arguably the largest neural network in history. They trained 2000 machines for 32,000 cores for 1 weeks, and the accuracy of the classification obtained on the imagenet data set was 70% higher than the current best results! The work has been widely reported in the news media such as The New York Times, the BBC and times.
Figure 71 a sub-network of the depth self-encoder [9]. Multiple identical subnets need to be added together.
In short, the concept of deep learning is now hot, widely recognized by academia and industry, a large number of scholars are coming from different fields to join in the fun, ICML, NIPS, IEEE Trans Pami and other famous conferences and periodicals more and more related papers. From the present situation, this huge feast will last for at least several years.
Finally, a brief comment on the possible trends of the depth network in the coming years. I think that at least the following two areas deserve attention.
First, how to add a feedback connection to the deep network to improve performance. Only feedforward connections in existing deep networks have no feedback connection, which is different from the real neural network. Because of the complicated dynamic process of the feedback neural network, there is no general rule to follow, the training algorithm is generally not universal, and it is often to design different algorithms for different networks. Worse, compared with other machine learning methods that have arisen in recent years, these learning algorithms do not work well and do not have the scalability of data, and can not adapt to the large data processing requirements in the current network age. In recent years, there have been some important progress in this area, such as Reservoir Network and Echo State network [10], the basic idea is to divide weights into two parts, some have complex feedforward, feedback connection, fixed weights, do not need to learn, the other part of the connection is relatively simple (such as only linear feedforward), only to learn the weight. But how to use this idea in a deep network to improve performance is a problem that is being explored.
Second, the hardware and software cooperation. At present, most deep networks need a lot of computation, and parallelization is necessary. This is natural, because after all, the brain's processing of information is basically parallel. One way to do this is by parallel machines, as Google did on ICML in 2012 [9]; Another way is to use GPU parallelism. The latter is clearly more economically viable for individual researchers. However, the current GPU code is much more time-consuming and laborious for most researchers, depending on hardware vendors and software vendors working together to provide the industry with increasingly foolish programming tools.
Reference documents:
[1] Werbos P J. Beyond regression:new Tools for prediction and analysis in the behavioral sciences[d]. Boston:harvarduniversity, 1974.
[2] Rumelhart D, Hinton G, Williams R. Learning Representations by Back-propagating Errors[j]. Nature, 1986, 323:533–536.
[3] Hinton G E, Osindero S, Teh y-w. A Fast Learning Algorithm for deep belief nets[j]. Neural computation,2006,18:1527-1554.
[4] Erhan D, Bengio Y, Courville A, et al. Why does unsupervised pre-training help deep learning?[j]. Journal of machine learning, 2010,11:625-660.
[5] Hinton G E. Training products of the experts by minimizing contrastive divergence[j]. Neural computation, 2002,14:1771-1800.
[6] Hinton G E, Salakhutdinov R reducing the dimensionality of data with neural networks[j]. Science, 2006, 313 (5786): 504-507.
[8] LeCun y, Bottou L, Bengio y, et al gradient-based learning applied to document RECOGNITION[J]. Proc of the IEEE, 1998, 86 (11): 2278-2324.
[7] Dahl G, Yu D, Deng L, et al context-dependent pre-trained deep neural networks for large-vocabulary speech Recogniti ON[J]. IEEE transactions on Audio, Speech, and Language processing, 2012,20 (1): 30-42.
[8] Le Q, Ranzato M, Monga R, et al Building high-level features using large scale unsupervised learning[c]//icml2012. edinburgh:[s.n.],2012:81-88.
[9] Jaeger H, Haas H. Harnessing nonlinearity:predicting chaotic systems and saving energy in wireless COMMUNICATION[J]. Science, 2004,304:78-80.
Hu Xiaolin, professor of State Key laboratory of Computer science and technology, intelligent technology and system, Tsinghua University, Research direction of artificial neural network, neural and cognitive computing, E-mail:[email protected]
[1]①http://en.wikipedia.org/wiki/neural_network
[2]①http://page.renren.com/600674137/note/761726816
Deep learning--artificial neural network re-research craze