Deep learning and shallow learning
As the deep learning now in full swing, in various fields gradually occupy the status of State-of-the-art, last semester in a course project in the deep learning the effect, Recently, when I was doing something, I had a little bottleneck in the model and finally decided to come to know about this magical field.
It is said that the break through of deep learning can begin with the algorithm Hinton introduced in 2006 to train the Belief network (DBN), breaking the awkward situation of the multi-layered neural network that lasted for decades, which After a variety of other algorithms and models have been put forward and in various fields of application. Recently, Google hired Hinton, the New York Times cover reports and other public media propaganda, but also make deep learning become rounds. I remember seeing this passage in one of my boss's draft:
I do not expect that this paper would ever be published in the usual journals. "Success" for a paper published in the This to would consist, I believe, of making an impact–measured in terms of citations For Instance–and perhaps of being eventually "reviewed" in sites such as Wired or Slashdot or Facebook or even in a New S and Views-type article in traditional journals as science or Nature.
In academia too, for example in various applications, Automatic Speech recognition (ASR) not only deep learning beyond the traditional State-of-the-art algorithm, and the degree of transcendence of the ASR area itself A new break through (Hinton et al.); collaborative Filtering, deep learning occupies an important position in Netflix's final winning algorithm; computer Vision (CV) Beyond State-of-the-art results on a variety of large benchmark databases (e.g. Krizhevsky, Sutskever, & Hinton, 2012), Google is said to be in its image search Getting Started with deep LEARNING;NLP fields I don't know much about it, but from this deep learning for NLP (without Magic) Tutorial It is also in NLP Have achieved considerable success. Even the pure machine learning theory of the COLT also began to gather the excitement. There is a reading list on deeplearning.net , which lists some representative articles about deep learning in various fields. Since 2013, deep learning even has its own special meeting: International Conference on Learning Representations (ICLR).
From the name of the Conference can also be seen, deep learning is actually very important is to get good representation, various experiments show that through the deep learning out of the network, even the top of the classification/regression model lost, directly to the network as a Feature extractor, the extracted features dropped into the ordinary SVM and other classifiers, also often get performance improvement. Although from the information theory point of view, because the Data processing inequality causes feature extraction does not have in "the information" the improvement, but from the practical angle, a good representation is undoubtedly very important. In this connection, I have recently heard of a very vivid example: some people complain that multiplication is more difficult than addition, such as the sum of 9480208 and 302842, as long as you align, one by one to add and handle the carry is good, even if I such a slag mental arithmetic capacity estimation is OK, but if it is multiplication ... But the difficulty here is due to the fact that the decimal expression of the numbers we use often favors the addition calculation. If we change the expression: Each number can be expressed equivalently as a set of its prime factor, for example
It's easy to multiply the two numbers:
In turn, it is difficult to do addition under such representation. For the same reason, representation's problems have been a very important research topic in machine learning and related fields. Because of different problems, different data, and different models, the right representation can be quite dissimilar, and finding the right representation can often be done with less effort.
Figure 1HMAX. Convolutional Feedforward object recognition models inspired by the primate ventral visual pathway.
In a particular problem, the general acquisition of data after the processing of some feature extraction, such as Vision in the SIFT + Bag of Words, or Speech in the MFCC and other features, these feature extraction algorithms are often based on the characteristics of the problem data manually designed, and and the design of better feature is actually a very important research problem in all fields. And now the results of deep learning more like to do is to start from the "raw data" (such as the Pixel Bitmap in Vision) to automatically learn representation, and give a better design than the previous artificial feature effect is better. But I don't think it means that deep learning is a panacea here, because on the one hand it is very important to effectively combine domain knowledge in a known domain, on the other hand, it is not like a black Box just throws raw data over it and it can magically give decent features. Deep model training difficulties seem to be recognized, and such as the Convolutional Neural Network (CNN), such as a model of its web structure itself is based on underlying data itself requires the invariance characteristics of people Design, for example, as in speech, the best practice now seems to be to do deep learning on the Mel Frequency Filter Bank data instead of the most original sound waveform based on speech Data's various classic processing operations.
In addition to the manual feature extraction, deep learning also has many other so-called "shallow" data-driven feature extraction algorithms. The most classic PCA dimensionality can be explained in terms of noise removal and many other aspects. On data such as microarray, the dimensions of each sample point are very high, and due to the high cost of sample collection, the number of samples is very low, so a variety of dimensionality reduction or feature selection methods emerge to limit the complexity of the model, Avoid serious overfitting problems on small sample data.
On the other side of the world, the "Big Data" era, where data is as cheap as cabbage, the abundance of samples (and the improvement of computer performance) makes it possible to train more complex models, which in turn "dimension up" for richer data representation, where Kernel Method is a classic tool. The basic idea of a nuclear method is to map the data to a reproducing Kernel Hilbert Space (RKHS) by a linear map induced by a positive definite nucleus, and then use the linear model in RKHS to process the data.
This is the equivalent of a non-linear feature extraction process. On the one hand, through the kernel function can be effectively in the original data space dimension of the complexity of the map after the point of the feature space (inner product) calculation; On the other hand, the characteristic space of a kernel function, such as a Gaussian nucleus, is actually an infinite dimensional space, which can be said to have considerable degrees of freedom. In addition, the kernel method is non-parametric, that is, it is not necessary to assume that target hypothesis is some form of function, but can find the closest function in the whole RKHS, and then through Representer theorem can put these The optimization function transforms to the optimization problem on the finite dimensional space, and the final approximate target becomes the function that the kernel function "Interpolate" on the training data:
So, in a sense, the nuclear method does not seem to be using the whole, but only in the subspace of a spanned mapped by the training data. Since Representer theorem is actually guaranteed to be the same as the optimal solution optimized throughout and optimized in this subspace, the limitations here are not actually derived from the Kernel Method, but from the use of limited training data through empirical Risk Minimization to the problem caused by Risk minimization. In addition, there are a lot of studies and conclusions on nuclear methods in learning theory. Although the theoretical Bounds in machine learning theory have so far seldom been used to directly direct specific problems such as parameter selection, the theoretical work is still not negligible.
The working principle of the nuclear method has a rough and intuitive interpretation, considering the most commonly used Gaussian nuclei, which are the kernel parameters. For a particular point, depending on the size of the given, the value of the data point beyond a certain radius can be as small as negligible. So the linear combination is actually just a local neighborhood around the point.
Figure 2The kernel method uses the local neighbors for interpolation calculations.
That is, it can be approximated as a local linear regression in each local neighborhood, while globally limiting the overlapping coefficients of the corresponding linear regression of local neighborhood, and possibly some global Regularization or something like that. If a function is smooth, or our data points are dense enough, and the kernel function is selected so that the local neighborhood size is more appropriate, the function is usually approximated by linear functions in the local domain of each point.
However, the seemingly fine nature has also been questioned (Bengio, Delalleau, & Roux, 2005), because if the training data cover less area, it seems that the accuracy of such a difference is debatable, and such a method looks more like doing " Memorizing ", not" learning ". Yoshua Bengio called this The local representation, a model based on the local smoothness hypothesis that is heavily (usually exponentially growing) dependent on the dimensions of the data (or the intrinsic dimensions of the data manifold), resulting in dimensional catastrophes. In the article (Bengio, Delalleau, & Roux, 2005) He further pointed out that such methods are also not suitable for learning functions that have many variations in the local (such as the sin function like high frequency). are the corresponding functions in the actual application a locally smooth or high-frequency variation? If the function of high-frequency change is irregular, the problem itself is difficult from the point of view of information theory, but if the local change but the overall expression of regularity, even if the local algorithm can not be processed, the algorithm from the global perspective may still be processed. The simplest example is the simple high-frequency sin function itself. Ideally, if we have prior knowledge of the global schema of this function, then as long as the appropriate representation, the problem is usually converted into a "simple" form, but whether such a global pattern can let the algorithm automatically through the data learned it?
An attempt to explain why deep learning will have a better effect is that deep learning get the so-called "distributed Representation" (Bengio, Bengio E, & Vincent, 2013).
Another problem with nuclear methods is that there are basically not too many options in practice, such as the kernel functions available in LIBSVM's help message:
- -T kernel_type:set type of kernel function (default 2)
- 0--Linear:u ' *v
- 1--Polynomial: (gamma*u ' *v + coef0) ^degree
- 2--Radial basis function:exp (-gamma*|u-v|^2)
- 3--Sigmoid:tanh (gamma*u ' *v + coef0)
While there may be kernel functions for domain specific such as text cores in some special areas, it is not a trivial thing to construct a kernel function, because you have to make sure it is positive definite. Due to this limitation, there is also some work on how to extend the framework similar to kernel to ordinary similarity function (Balcan, Blum, & Srebro, 2008) that do not require positive definite properties.
Another problem is the computational complexity: the nuclear matrix involved in the calculation of the nuclear method is the size, which represents the number of training data points, in the application of a large number of data, the nuclear matrix, whether from the computation or storage has become extremely difficult, although there are many methods of sampling subset to the nuclear matrix approximation study ( Williams & Seeger, 2000), but many times still have to fall back to linear kernel, using another set of formulation, you can allow the computational complexity to grow as the dimensions of the data grow rather than as the number of data points. However, the function of the nonlinear feature mapping brought by kernel is gone, because the so-called linear kernel is actually equivalent to not using any kernel.
Aside from the calculation performance considerations, take the ordinary kernel function, and deep learning representation learning there is another important difference is like the Gaussian nucleus and other nuclear functions, the corresponding expression is predetermined, rather than through the data derived. Of course, there are a lot of studies on the kernel aspect of Data-driven, for example, it was previously pointed out that classical manifold learning algorithms such as Isomap, LE and LLE were actually equivalent to constructing a special driven of data kernel and then doing kernel PCA (Ham, Lee, Mika, & Scholkopf, 2004), explicit Data-driven Kernel is directly Kernel matrix as a variable (positive definite matrix) through Semi-definite Programming (SDP) for optimization (Lanckriet, Cristianini, Bartlett, Ghaoui, & Jordan, 2004), although the SDP is a convex optimization, but basically the data size is slightly larger and slower 。
In addition, there is a related work is multiple Kernel learning (MKL), the combination of multiple Kernel, because the Kernel combination of coefficients are based on training data optimization, so this is actually data-driv En is a special case of the representation learning, and because on the basis of kernel in a layer of composition, so it seems to be more than the normal shallow architecture a layer. The coefficients of the Kernel combination are somewhat similar to the hidden layer in a multilayer neural network. The neural networks (or other) models that used to be the most commonly used only one layer (or no) hidden layer are called shallow, and more than one layer of hidden layers is called a deep model.
Figure 3The classical Apache "It Works" page. Image from the Internet.
Getting good representation is a critical issue, and the local representation based on Kernel have encountered bottlenecks in front of complex AI-related issues. But why must it be deep? There are all sorts of reasons, but I think the most important reason is that the open source Web Server Apache has been quietly emphasizing for several years: every time you have just installed Apache to open the main page, the sentence that appears:
It works!
As mentioned at the same start, although the deep model is said to be difficult to train many tricky, people still succeed in using deep learning methods in various fields of application to defeat or even defeat the various state-of-the-art of the past. This is convincing enough from a practical point of view, but a curious man will of course want to know why. There is also a lot of work to interpret and explore in this area, as listed below.
Figure 4Vanship from "Last Exile".
One explanation is from the point of view of biology or neuroscience: Because of the present research aspects of human intelligence systems, especially visual systems, the human brain's information processing mechanism in this regard is a layered abstraction of hierarichical architecture (Serre et al., 2007). Although it sounds persuasive, it does not really explain why the multilayer structure is better, but just say that we have learned this way, so it sounds a little bit more vigilant not to be muddle through, Yann LeCun in a tutorial Cited a comparative image of the example: human manufacturing aircraft is not simply followed by zoology in the hand to stick two wings can fly, but in understanding why such a structure can fly the essence of the reason, that is, behind the aerodynamics and other theories, only really mastered the sky flying "skills."
The
On the other hand is about the so-called highly Variable Functions (Bengio, Delalleau, & Roux, 2005) that are not well handled by the Kernel methods that have just been discussed, while the deep Architectur E can be used to express such mappings more effectively. More general, although we have just mentioned that only a layer of neural network hidden has a certain universal nature, but it is not necessarily efficient: there are some functions can be concise through The logic Gate network is calculated, but if it is limited to the layer, it will need an exponential level of logic gate to line (Bengio, 2009). There are, of course, many unanswered questions, such as what are the links between these logic gates and the functions encountered in machine learning problems (Orponen, 1994)? Machine learning is encountered in the problem is such highly variable, must use deep architecture to effectively express? Is such a function space tangle not learnable? What are the difficulties in optimizing and solving (Glorot & Bengio, 2010)? Is there/how to guarantee the generalization performance of learning? Wait, wait.
In fact, people have tried to train a multi-layered complex neural network like human brains from the last century, but it is usually impossible to train an ideal model after the number of layers in a neural network is large, especially when proving that a hidden layer is needed to ensure that the neural network can express arbitrary bool functions. (Mendelson, 2009), has become less powerful. So in addition to the convolutional network such structures have been specially designed neural networks, general's deep architecture, until 2006 Hinton they introduced greedy layer wise Pre-train ing (Hinton, Osindero, & Teh, 2006) It was only after the first time that the power of the deep model was realized. After Teh based on Restricted Boltzmann Machine (RBM) (Hinton, Osindero, & 2006, pre-training) , a variety of Auto encod ER (AE) variant (Vincent, Larochelle, Lajoie, Bengio, & Manzagol,), (Rifai, Vincent, Muller, Glorot, & B Engio, and even supervised layer-wise pre-training (Bengio, Lamblin, Popovici, & Larochelle, 2006).
So there must be someone to ask: why pre-training work? Does it have to be pre-training to work? Wait a minute. In general, the objective function of training neural networks is very poor optimization, for example, there are very many local optimal values and so on. It is generally believed that the result of using pre-training as initialization of back-propagation helps to place the initial search point (stochastic) gradient descent in a better place, thus converging to a better (local optimal) solution. In addition, pre-training is considered to play a regularization role in enhancing generalization performance. For a detailed discussion of this, refer to (Erhan, Courville, Bengio, & Vincent, 2010).
As to whether it is necessary to do pre-training, from the experimental results, we already know that when the training data enough, choose the appropriate (random) initial value and the non-linearity between the neurons, do not use pre-training and direct Supervised training can also get good results (Ciresan, Meier, Gambardella, & Schmidhuber,), (Glorot, Bordes, & Bengio, 20 One), (Sutskever, Martens, Dahl, & Hinton, 2013). However, these results are usually based on a large amount of data, combined with various trick (Montavon, ORR, & Muller, 2012), coupled with high-performance GPU devices and specially optimized parallel algorithms, after training "long enough" to get results. So why the deep neural network model was not successfully trained before the era of "Big Data" and "GPU parallelism" did not seem hard to explain.
More in-depth analysis and justification aspects, usually from the "deep architecture training why difficult" this issue to explore (Glorot & Bengio, 2010). It is generally believed that when training deep neural network, the objective function itself has a lot of local minima and plateaus, the first order of gradient descent method is easy to fall into the local optimal and can not extricate themselves, so people naturally want to try second order Method. However, due to the many parameters of the neural network, the Hessian matrix is not only difficult to calculate, even with various approximate methods, it is more troublesome to store the entire Hessian matrix. So one of the second-order optimization algorithms called Hessian Free (HF) (Martens, 2010) is particularly interesting, and it uses R-operator (Pearlmutter, 1994) to directly compute the product of the Hessian matrix with a vector, rather than first The whole Hessian matrix is calculated and then multiplied by the normal matrix operation to multiply the vector. Experimental results show that using HF Second order optimization can achieve very good results without using any pre-training.
Here halfway through: There is a Python library called Theano, provides deep learning optimization related to the various building blocks, such as providing a symbolic operation to automatically calculate gradient function, so you do not have to go to hand to calculate gradient write Back-propagation, and also integrates the R-operator for second-order optimization. The final calculation is automatically compiled into native code for fast execution, and can be seamlessly compiled into GPU parallel code to accelerate computations (although it seems to support CUDA at the moment) in the presence of a GPU device. There is a deep learning Tutorial is the use of Theano to introduce and implement a number of mainstream deep learning algorithm.
Back to the question just now, the success of HF optimization can be said to open a door: directly from the general optimization algorithm, it will be a very worthy of exploration direction. But deep architecture training in addition to local minima and plateaus, there is a problem is that the highest two layers of the network is also very easy to overfit, so the optimization of the objective function is sometimes not too much to explain the problem: because basically the top two layers Overfitting went, flow back to the following layer of information is very few, so the bottom layer of weights almost no training, but also stay in the original random initialization stage, the results of such training results almost completely no Generalization ability. Come in about rectifier non-linearity (Glorot, Bordes, & Bengio,), (Krizhevsky, Sutskever, & Hinton, 2012) One of the related studies called maxout (Goodfellow, Warde-farley, Mirza, Courville, & Bengio,) was found to make the lower-level weights get more Training In addition, such as dropout (Hinton, Srivastava, Krizhevsky, Sutskever, & Salakhutdinov,), (Wang & Manning, Added noise is also used in practice as a powerful regularizer to avoid overfitting.
Although the first thought of neural network is certainly overfitting, everyone's focus is almost always trying to solve overfitting problems, but some recent experiments (Dauphin & Bengio, 2013) show that in data and God After the scale of the network reaches a certain degree, it seems that the problem of under fitting has arisen due to the difficulty of optimization problem. There are other aspects of the difficulties, you can refer to Yoshua Bengio in a recent article (Bengio, 2013) summarizes the current in deep learning encountered in the various problems and challenges, as well as possible solutions to the ideas and so on.
In the end, I did not specialize in the overall survey of the application, but the current flying deep learning related applications seem to focus mostly on AI-related classical issues (such as Objection recognition, Speech Recognition, NLP and the like), or more general, a lot of work is focused on classification. So I think it's interesting to know whether this kind of deep model is an AI-related problem with some special structural advantages (analogous to the hierarchical abstraction mechanism of human intelligence systems), or whether such models can achieve far more than other common shallow models in other non-traditional AI fields. The other is the hierarchical abstraction or the mechanism of raising invariablility by layers like the convolutional network seems natural to classification problems, but what about regression? It seems less useful to see examples of the deep neural network to solve specific multi-output regression problems.
Figure 5Neural networks:tricks of the Trade (2nd Edition).
As for the specific deep learning model and related to the training of the details of the algorithm, such as the original want to have time to detail, but it seems like the end of the summer vacation, I also dug a lot of pits have not been filled, so 1:30 will seem to be less able to write more detailed things. How will deep learning be developed? Is it the holy grail of AI? Just wait and see.:)
References
- Balcan, M.-f, Blum, A., & Srebro, N. (2008). A Theory of learning with similarity functions. Machine Learning, (1-2), 89–112.
- Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trendsin machine learning, 2(1), 1–127.
- Bengio, Y. (2013). Deep learning of representations:looking Forward. In slsp (pp. 1–37).
- Bengio, Y., Courville, A. C., & Vincent, P. (2013). Representation learning:a Review and New perspectives. IEEE Trans. Pattern Anal. Mach. Intell., (8), 1798–1828.
- Bengio, Y., Delalleau, O., & Roux, N. L. (2005). The curse of highly Variable Functions for Local Kernel machines. In NIPS.
- Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy layer-wise Training of deep Networks. In NIPS (pp. 153–160).
- Ciresan, D. C., Meier, U., Gambardella, L. M., & Schmidhuber, J. (2010). Deep, Big and simple neural Nets for handwritten Digit recognition. Neural Computation, (12), 3207–3220.
- Dauphin, Y., & Bengio, Y. (2013). Big neural Networks waste capacity. CoRR, abs/1301.3583.
- Erhan, D., Courville, A. C., Bengio, Y., & Vincent, P. (2010). Why Does unsupervised pre-training help deep learning? aistats, 9, 201–208.
- Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. aistats, 9, 249–256.
- Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep Sparse rectifier Neural Networks. aistats, 315–323.
- Goodfellow, I. J., Warde-farley, D., Mirza, M., Courville, A. C., & Bengio, Y. (2013). Maxout Networks. In ICML.
- Ham, J., Lee, D., Mika, S., & Scholkopf, B. (2004). A kernel view of the dimensionality reduction of manifolds. In ICML.
- Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A Fast Learning algorithm for deep belief Nets. Neural Computation, (7), 1527–1554.
- Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580.
- Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., jaitly, N., ... Kingsbury, B. (2012). Deep neural Networks for acoustic Modeling in Speech recognition:the gkfx views The four of the Groups. IEEE Signal processing Magazine, (6), 82–97.
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural Networks. In NIPS (pp. 1106–1114).
- Lanckriet, G. R. G., Cristianini, N., Bartlett, P. L., Ghaoui, L. E., & Jordan, M. I. (2004). Learning the Kernel Matrix with semidefinite programming. JMLR, 5, 27–72.
- Martens, J. (2010). Deep learning via Hessian-free optimization. In ICML (pp. 735–742).
- Mendelson, E. (2009). Introduction to Mathematical Logic (5th ed.). Chapman and HALL/CRC.
- Montavon, G., Orr, G., & Muller, K.-r. Neural Networks:tricks of the Trade (2nd ed.). Springer.
- Orponen, P. (1994). Computational complexity of neural networks:a Survey. Nordic Journal of Computing.
- Pearlmutter, B. A. (1994). Fast Exact multiplication by the Hessian. Neural Computation, 6(1), 147–160.
- Rifai, S., Vincent, P., Muller, X., Glorot, x., & Bengio, Y. (2011). Contractive auto-encoders:explicit invariance During Feature Extraction. In ICML (pp. 833–840).
- Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., & Poggio, T. (2007). A quantitative theory of immediate visual recognition. Progress in Brain, 165, 33–56.
- Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In ICML.
- Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-a. Stacked denoising autoencoders:learning useful representations in a deep Network with a Local denoising Criterion . JMLR, one, 3371–3408.
- Wang, S., & Manning, C. (2013). Fast dropout Training. In ICML.
- Williams, C. K. I., & Seeger, M. (2000). Using the Nystrom Method to speed up Kernel machines. InNIPS (pp. 682–688).
Deep learning and shallow learning