Author profile: Jie, Etsy data science director, former senior manager of Yahoo Institute. Long-term research work in recommender systems, machine learning and artificial intelligence, published more than 20 papers at top international conferences, and has long served as a member and reviewer of several international conferences and periodicals accreditation committees.
Zebian: He Yongcan, Welcome to the field of artificial intelligence technology contributions, manuscripts, to the article error correction, please send mail to heyc@csdn.net
This article is "programmer" original article, not allowed to reprint, more wonderful articles please subscribe to the 2017 "programmer"
Voluminous academic papers in the field of artificial intelligence and machine learning. Each year's top conferences and seminars employ thousands of of papers, and even in person, it is difficult to trace all the cutting-edge information. In the case of limited time and energy, choosing which papers to study, and learning which hot technology becomes the headache for AI scholars and practitioners. The purpose of this column is to help you to screen out interesting papers, to interpret the core ideas of the paper, to provide reading guidance for intensive reading.
NIPS (Neural information processing systems, the Progress conference on Neural Information Processing systems) is a top-level meeting of AI and machine learning, hosted by the NIPS Foundation in December each year, which attracts machine learning, artificial intelligence, A number of international experts in the fields of statistics. The author has selected 10 interesting articles from the Nips 2016 conference articles for the readers ' doubts. Using Fast Weights to attend to the recent
Highlights: What better attention mechanism is there in short-term memory, long-term memory and lstm?
The author is a luxury writer with Jimmy Ba from the University of Toronto, Volodymyr Mnih of Google DeepMind, Joel Leibo and Catalin Ionescu, plus Geoffrey Hinton. The article begins with a clear understanding of the current problems, in the traditional recurrent neural Networks (RNN) field, there are two forms of memory. These two types of memory have different structures, purposes, and capacities (capacity). Short-term memory directly through the hidden vector to store information, the capacity is O (h), where H is the number of hidden units. On the other hand, long-term memory uses the current input information and hidden vector, to get the next output information as well as the new hidden vector, the total capacity is O (H2) +o (IH) +o (HO), Here I and O are the number of input units and output units. Another relatively traditional long short-term memory Networks (LSTM) still has only O (H) capacity to process short-term memory. The core of the article is to propose a mechanism that can provide memory more effectively. Of course, the article uses a section from the physiological point of view of how to have such an inspiration, but I am afraid that the main purpose of the article is to be elevated, in fact, and the main model part of the following is not directly linked. Simply put, the model presented in this article is based on the traditional rnn of this improvement:
The next hidden vector comes from two factors: the current hidden vector and the current input information, and a similar attention mechanism, but this article, called Fast Weights matrix, acts on the previous hidden vector.
This fast weights will have a decay message over time.
How to understand fast weights? Intuitively, Fast weights is a attention mechanism that compares the current hidden vector to any hidden vector in the past and determines the outer strength by attention product results. With such a attention mechanism, the entire model can recall similar memories of the past, producing a comprehensive response to recent information. In order to stabilize fast weights, the article also uses layer normalization technology. Some of the experimental results are astonishing, for example, in a man-made data set, the model effect can easily reach 0 error rate. And in the Mnist data to do visual Attention, the proposed model can also have a very good effect. In short, this article is worth extensive reading. For readers of the attention mechanism, it is a material of intensive reading. learning structured sparsity in deep neural Networks
Highlights: How to combine the structured sparisity and DNN that prevailed in previous years. This article gives an idea.
This article from Pittsburg University researchers, the core content is very clear, that is to introduce structured sparsity to DNN, so that the final DNN has a relatively compact representation, speed up the computation speed, but also can get a hardware-friendly representation, so that the hardware can be executed relatively quickly. Although some work has been done to compress DNN, the author of this article argues that these compression methods (such as direct use of L1 regularization) may cause the network to get random links that make memory access unconventional (iregular). In this case, the new model, although the surface has a large sparsity, but did not speed up the calculation, and sometimes even reduced. Another recent idea is the recently adopted Low-rank approximation method. In short, this method is to train the DNN first, then the tensor of each layer is decomposed, and the smaller factor are substituted for the approximation. The advantage of this approach is that it can be accelerated, and the downside is the need to eventually re-fine-tune the accuracy of the model. Obviously, this article is to solve the above shortcomings. The article authors combine the fiery structured sparisty Learning (SSL) and DNN in previous years. Specifically, the group Lasso method is used to allow the DNN parameters to be structured to 0. In this article, the author uses three methods: punish (penalizing) unimportant filter and channel: Put some filter and channel 0; Learn the arbitrary shape of the filter: Learning through 0 in 2D space, to achieve the requirements of learning arbitrary shape; Shorten the number of layers in the DNN: Completely remove the entire layer, by increasing the shortcut method to achieve the situation without fault.
The article does not provide the learning algorithm in the case of SSL and DNN combination. The experimental part is very detailed, with lenet in Mnist, Convnet and ResNet in CIFAR-10 and Alexnet in Imagenet. The overall feeling is that, in many cases, the more sparsity DNN instead brings the improved accuracy. Operator variational Inference
Highlights: Know what's wrong with KL divergence in variational inference? This article can give you some inspiration.
This article is from the lab of David Blei. The main thrust is relatively straightforward, but the details are very technical, and the core idea is how to improve the KL divergence in variational inference (VI). As we all know, the connotation of VI is how to transform a Bayesian inference problem into an optimization (optimization) problem. In the classic settings, find or solve the process of posterior distribution, is a divergence definition in KL, find and real posterior distribution similar variational The process of distribution. There are two problems with this process: Typically, the variance of posterior is underestimated, resulting in an error solution that automatically excludes some latent variable configuration.
At the same time, under the assumption of KL divergence, objective may become infinite, when variational distribution's support is larger than the actual posterior distribution. To address these issues, this article presents a new framework called operator variational objectives, which has three components: a operator, based on posterior and variational distribution ; a family test Functions; a distance function (Distance functions).
The traditional KL divergence-based VI can be written as a special form of this new framework. Based on this new framework, the optimization process is very difficult. The article does not have the actual experiment, if you have the deep interest to VI, may read this article. Believing that most models today choose KL Divergence, this article does not provide more practical help. However, the question of KL divergence discussed in the article can be used as a reference. exponential Family embeddings
Highlights: Is not dazzled by various embedding models. This article unifies a lot of similar models and provides a simple framework.
This article is also from the lab of David Blei. The core of the article is how to give the Word2vec idea to generalize to other application scenarios, and provide a more general model framework. Under this new framework, many other similar models can be summed up as a special form of the new framework. The new framework model is exponential Family embedding (EF-EMB), which contains three elements: a Context Function; a Conditional exponential Family; an embedding S Tructure.
First, the Context function defines how the current data point is linked to other data points within a context. This is a modeling choice. For example, for language data, this context can be the surrounding word, and for neural data, this context can be the surrounding neuron; for shopping data, This context is probably the other item in the shopping cart. Second, Conditional exponential family defines a suitable distribution to explain the process of data generation. For example, for language modeling, this distribution is categorical distribution, and for real-valued data, Gaussian distribution. In addition, in this conditional exponential family definition, each data point has two embeddings:embedding vectors and a context vector. In layman's words, the decomposition of each data point becomes the product form of the embedding vector and a set of context vectors, defined by the context function above. The third feature, embedding structure, defines how embeddings is shared (GKFX) in modeling. For example, for language data, each word has only the embedding vector and the only context vector. There are other setting, and the two can be the same. After defining the previous structures, the Objective function is the sums of the log Conditional probability, plus the Regularizer defined by the log. Examples of several models are discussed in this paper. In short, some of the existing embedding models are easily reproduced in this framework. The model inference is based on SGD. The article also discusses how to get results similar to negative sampling in the case of SGD. In short, this article is worth a closer look. On the one hand, there is a lot of discussion about the embedding model, and on the other hand, it may be possible to design and implement a type of model framework from a software engineering standpoint.The generalized reparameterization Gradient
Highlights: Reparameterization Gradients is one of the most important developments in the recent variational inference, how to expand it.
Variational inference (VI) is traditionally a popular solution to the posterior inference (PI) of probabilistic models. The core idea is to transform the problem of approaching posterior distribution into a solution to the KL divergence optimization problem, thus using many existing optimization methods to solve the pi problem. When probabilistic models conforms to the distribution of exponential family, vi can be easily solved by coordinate ascent. However, in the practical application, many models do not conform to such conditions, so there are many ways to solve the VI in the Non-conjugate case implementation. The general idea is to use Monte Carlo method to obtain an estimate of the gradient (Gradient) corresponding to the variational objective, and then use such an estimate for stochastic optimization, The variational Parameters is then fitted. In this way of thinking, there are two main research directions: Black-box VI and Reparameterization gradients (RG). The idea of RG is to convert the latent variable into a set of auxiliary (auxiliary) parameters so that the new parameters do not depend on the variational Parameters. Such an operation can make the new variational objective more convenient for gradient operation, so as to simplify the optimization process. However, the main problem with RG is that such an approach is not universal (general) and can only be useful for simple Gaussian variational distribution, and for gamma-like, Beta distributions require further approximation processing (approximation). This article wants to provide a general meaning of RG. Here, we don't repeat the details, but the core of the algorithm is to divide the gradient of the variational parameter into three parts: the first part is the usual (RG); the second part is a correction term, If the converted variational distribution does not rely on the original variational Parameter, then this part becomes 0, and the third part is the standard entropy gradient. In this algorithm, the article shows how to Gamma, log-Normal and beta distribution are processed so that these distributions can be RG-transformed. At the same time, the paper gives an optimization algorithm under the general meaning. In general, this article is a practical type of RG to push forward a step.Can Active Memory Replace Attention
Summary: Can Active memory replace attention? This article wants to explore such a topic. However, from the results, the answer is, no.
This article is from Google Brain Lukasz Kaiser and Samy Bengio. The main thrust of the article is to use a mechanism called active memory to replace the attention mechanism. By extending a NEURAL-GPU model presented by the first author in Iclr 2016, the article makes it capable of active memory and is called the Extended-neural GPU, which is demonstrated by machine translation to rival attention. However, readers need to note that the active memory mechanism presented in this paper is mainly based on convolution Operator, and whether it can be extended to other models, and further discussion is needed. The most valuable part of the article is the attention mechanism and the discussion of the active memory mechanism. From the perspective of the development of the model, the paper points out that the attention mechanism is to solve the problem of using RNN for machine translation, because a fixed dimension of the hidden Vector, resulting in a decline in translation effect, even in the longer sentences of the translation effect is further deteriorated. Essentially, the attention mechanism is the combination of these intermediate results, not just a fixed-length implied state, but a so-called memory Tensor, at each step of decoding, a distribution based on the memory of the past will be computed, Then the input of the decoder (Decoder) is a weighted average of these memory in the past. Thus, under such a mechanism, the decoder can focus on the different details of the past, thus generating the required characters. This set of attention mechanisms has been considered to be better than machine translation, such as graphic models. The article argues that the limitation of attention lies in its definition, that is, softmax in attention definition. The Softmax still wants to focus on a unit of memory in the past. The article argues that this limitation makes the attention mechanism completely unable to complete the corresponding learning function in some tasks. Whether this limitation can be broken. The article thinks that acitve memory mechanism can break the limitation of attention. In short, Active memory is decoding this step to rely on and access all memory, each step decoding the memory is different. Of course, this mechanism has been proposed in the previous Neural-gpu, and in that article shows the good performance of the algorithm task (algorithmic tasks). But in the traditional machine translation task, such model effect is not ideal. This article is to improve the machine translation task by making small improvements to the model. Over hereWe do not repeat the improvement of the model because it feels that this improvement is not universally applicable but hack to enhance the performance of the model. However, the article pointed out that the author's ideas and grid lstm more similar, interested readers can refer to. After a series of hack, the newly proposed extended NEURAL-GPU has similar model performance to gru+attention in the task of machine translation. For readers interested in the attention mechanism, this article can be intensive.Stein variational Gradient descent:a General Purpose Bayesian inference algorithm
Absrtact: The difficulty of variational inference is that there is no universal algorithm pattern, this article may be an inspiration.
As we all know, the difficulty of Bayesian inference is how to calculate the posterior distribution. Markov chain Monte Carlo (MCMC) is an effective tool for solving such problems for a long time. However, the disadvantage of MCMC is that it is slow and difficult to determine whether it has been converge. So, this is also a lot of time variational inference (vi) seems to be more attractive reason, because VI is often a deterministic algorithm, and many optimization (optimization) field of tools can be used. The problem with VI is that, for different models, a separate derivation is generally required, and there is no uniform general form of the algorithm to solve the model. How to put forward a general meaning algorithm to optimize VI, is the most popular research topic in VI Field recently. This article is also an attempt to push the field. The proposed algorithm itself is relatively simple, with the following features: The initial algorithm from a simple distribution out a bunch of particles (can also be considered as samples). Then the iterations are repeated, and all particles are moving in the direction of reducing the KL divergence at each iteration, a step that, in the author's view, is analogous to gradient descent. Finally, the algorithm returns a bunch of particles, which can already represent posterior distribution.
The key to this algorithm is how the second step, in short, involves two parts: moving the particles to the high probability area of the posterior distribution, which makes the particles representative. At the same time, do not let these particles together (collapsed), that is to say, still hope that these particles have diversity to represent the whole posterior distribution of the various parts.
In fact, the difficult and deep point of the article is to explain why this process is the correct algorithm, which involves the so-called Stein identity and kernelized Stein discrenpancy. This is no longer the case, interested readers can pay attention to the original text. The experimental part of the article is relatively simple, first to a one-dimensional Gaussian distribution situation did validation, to ensure that can run. The experiment was followed in the Bayesian Logistic regression and the Bayesian neural network, which contrasted a series of methods and datasets. Overall, the proposed algorithm has two major advantages: first, the accuracy is significantly higher than the other algorithms, second, the speed increased significantly. For this new algorithm article, you might want to see it applied to more complex models and larger data. coresets for Scalable Bayesian Logistic Regression
Absrtact: In the wave of large-scale machine learning, the main idea is to improve the algorithm itself to adapt to the increase of data. This article proposes a novel approach to building a representative set of data that expands the size of the algorithm.
This article is from the laboratory of Professor Tamara Broderick at MIT. Tamara was a former student of Michael Jordan, mainly studying the Bayesian nonparametric model. The idea of the article is relatively novel, in the traditional Bayesian inference algorithm based on single machine is extended to big data, the general idea is to improve the algorithm itself. For example, the article mentioned streaming variational inference or distributed MCMC algorithm, are to be based on the classical algorithm to make changes to adapt to big data application scenarios. For this general idea, the article thinks that these improved algorithms often lack the rigorous proof of theory, and there is no guarantee for the quality of the algorithm. The observation in this article is based on the assumption that, in the case of big data, the data itself is often redundant. For example, in the event of a news event, many reports of the incident are similar. The fundamental idea of this article is to try to change the data set, not the algorithm itself, to achieve the large-scale application of the algorithm. The article takes a concept called Coreset, which is a weighted subset of data used to approximate the complete data. The concept of Coreset has been studied in algorithms such as K-means or PCA, and has not been applied to Bayesian in the first instance. This article is to use Bayesian Logistic regression to do the example. So how is this coreset built? This paper puts forward the algorithm: first of all based on a k-clustering (later in the experiment with the use of K-means), and then calculate a value called sensitivity, to measure whether each data point is redundant, the greater the value, the less redundant; Normalize all the sensitivity, and from the weight after normalize, sample out a set of data, and finally left a data set of not 0 weight.
The article is a strict proof of this coreset, which is no longer discussed here. The experiments in the article compare the generated datasets and the real data sets. On several data sets, the algorithm using Coreset can quickly reach the effect of the common algorithm on the complete set on the order of magnitude thousands of to tens of thousands of. However, the article also leaves a few very fundamental questions, such as this coreset appears to be for the logistic regression special structure, do not know how to construct other algorithms. In addition, the algorithm itself needs to k-clustering the data, which may be difficult for big data to achieve, so the overall efficiency of the algorithm has yet to be tested. But these do not obscure the novel idea of this article. Data programming:creating Large Training sets, Quickly
Absrtact: In many machine learning tasks, building a labeled data set can be the most labor-intensive step. This article proposes a theory called data programming to try to solve this problem.
This article is from a group of academics at Stanford University. They want to solve the problem that in many machine learning tasks, building a labeled data set can be the most labor-intensive step. The topic of this article is how to effectively reduce the time and effort of this step. This paper proposes a concept called data programming. In short, in this framework, the user provides a set of heuristic labeling functions (labeling Functions). These labeling functions can be conflicting, repeatable, or dependent on the external knowledge base. Then, the framework of the paper is to learn the correlation relationship between the various labeling functions, so that we can use many labeling functions to achieve the effect of supervised learning (supervised learning). This article uses the logistic regression as an example in binary classification problem. Each heuristic callout function has two parameters, one that controls how much is possible to label an object, and the other controls the accuracy of the callout object. Therefore, learning these two parameters becomes the main part of the objective function. In the case where all the labeling functions are independent, the maximum likelihood (Maximum likelihood estimation) method is used to estimate the values of the two parameters. Having obtained these two estimates, the authors further used the original logistic regression to learn a classifier. In other words, the entire frame is divided into two parts. Of course, the function of independent labeling functions is still limited. This paper presents a method similar to Markov Random field to deal with the interrelationship between the various labeling functions. In the data experiment, the method based on programming has significant effect on the feature of artificial feature or lstm automatic learning. This article is ideal for academics who need to study and study crowdsourcing. residual Networks behave like ensembles of relatively shallow Network
Abstract: Why the residual network can train deep-seated networks. This article begins with the integrated learning (Ensemble learning) approach, giving a new explanation to the residual network.
This article is from Cornell University scholars who, in this article, mainly want to explain the success of the residual network from the perspective of a new, integrated learning (Ensemble learning). The contribution of this article mainly has the following three aspects: the article shows that the residual network can actually be seen as a collection of paths, not just a deep network. The study finds that these paths do not fit closely with each other. At the same time, these paths show the effect of integrated learning. At the same time, the author also studies the gradient problem of the residual network, and finds that the short path only plays a role in the propagation of gradient, while the deeper path is not necessary to train the model.
The core of the article is actually the residual network of all levels between the path of all the expansion (unravel), so that the residual network is actually a lot of variable (Variable) path, that is, by the length of the path of the network group composition. With this revelation, it is easy to see that even if some of the nodes of the residual network are removed, this only affects a very large number of paths, but it has no particular effect on the overall path group. From this point, the residual network and the traditional Feed-forward network are very different. The author has done several experiments to show the effect of this variable path on the residual network. First, the residual Module in the residuals network is deleted and the same behavior is compared in the Vgg network. The effect is that the performance of the residual network has not been fundamentally changed, while the performance of Vgg is greatly compromised. Further, the author deletes the multiple modules in the residual network, observes that the error rises further, and discovers the correlation between the number and the performance of the module, and concludes that the residual network has the integrated learning effect. Another experiment, is the author arbitrarily change the order of the module, the result is amazing, the residual network incredibly for the part of the module exchange has a robust (robust) effect. At the end of the paper, the hypothesis of gradient is verified by some small simulation experiments, and the actual paths in the residual network are relatively short. The article should be open up a lot of future research topics, such as residual network if it does not really solve the deep network depth of the problem, but the diversity of the path brings the residual network performance improvement, so deep network need not very deep structure. Can you train many small networks with different structures, or dynamically generate these small networks, and then rely on integrated learning to achieve the effect of the residual network. These are topics that can be explored in the future.