Depth-enhanced learning frontier algorithm thought __deep

Source: Internet
Author: User

Author: Flood sung,csdn, an AI-oriented graduate student, focuses on deep learning, enhancing learning and robotics research.
Zebian: He Yongcan, Welcome to the field of artificial intelligence technology submissions, manuscripts, to the article error correction, please send mail to heyc@csdn.net
This article for the "programmer" original article, not allowed to reprint, more wonderful articles please subscribe to the 2017 "programmer"

2016 Alphago computer go system defeated the top professional chess player Li Shishi, aroused wide attention worldwide, artificial intelligence was pushed to the cusp. The depth-enhanced learning algorithm is the core of Alphago and the key to the realization of universal AI. This article will lead us to understand the depth of the learning of cutting-edge algorithm thinking, appreciate the core mysteries of artificial intelligence. Preface

Deep reinforcement Learning (Deep reinforcement LEARNING,DRL) is a branch of the deep learning field in the past two years, which aims to solve the problem of computer from perception to decision control, so as to realize universal artificial intelligence. Led by Google DeepMind, the algorithm based on depth-enhanced learning has made a breakthrough in video, games, Weiqi, robotics and other fields. 2016 Google DeepMind launched the Alphago go system, the use of Monte Carlo tree Search and in-depth learning to combine the way the computer's go level to reach even more than the level of the top professional players, causing a worldwide sensation. The core of Alphago is to use the depth-enhanced learning algorithm, which enables the computer to continuously enhance the chess force through the way of the game. The depth-enhanced learning algorithm, which can realize end-to-end self-learning from perceptual to decision control based on deep neural network, has a very wide application foreground, and its development will further promote the revolution of artificial intelligence. Deep Reinforcement learning and general artificial intelligence

The current depth of learning has been in computer vision, speech recognition, natural language understanding and other fields have made breakthroughs, related technologies have gradually matured and landed into our lives. However, the problems in these areas are only for the purpose of making computers feel and understand the world. At the same time, decision control is the core problem to be solved in the field of artificial intelligence. The perceptual problems such as computer vision require the input of perceptual information to the computer, the computer can understand, and the decision control problem requires the computer to be able to judge and think according to the perceptual information and output the correct behavior. In order for the computer to be able to make good decision control, the computer is required to have a certain "thinking" ability, so that the computer can learn to master the ability to solve various problems, which is the general artificial intelligence (Artificial Intelligence, AGI) (ie strong artificial intelligence) research goals. Universal Artificial intelligence is to create a body without human programming themselves to learn to solve various problems of the agent, the ultimate goal is to achieve the level of human-like and even superhuman level of intelligence.

The basic framework of generic AI is the framework for enhanced learning (reinforcement LEARNING,RL), as shown in Figure 1.


Fig. 1 Basic framework of general artificial intelligence

The behavior of the agent can be attributed to the interaction with the world. The body observes the world and then outputs the action based on observation and its own state, and the world changes so that the feedback is returned to the agent. So the core question is how to build an agent that can interact with the world. Depth-enhanced learning combines deep learning (Deep Learning) with enhanced learning (reinforcement Learning) to learn the mechanism for providing learning, while enhancing learning provides learning goals for deep learning. This allows deep reinforcement learning to have the potential to build complex agents, and therefore, Alphago's first author, David Silver, believes that deep reinforcement learning is equivalent to the generic AI drl=dl+rl=universal ai. actor-critic Framework for depth-enhanced learning

The current depth-enhanced learning algorithms can be included in the Actor-critic framework, as shown in Figure 2.


Figure 2 Actor-critic Frame

To think of the depth-enhanced learning algorithm as the brain of an intelligent body, the brain contains two parts: the actor action module and the critic evaluation module. Where the actor action module is the brain's actuator, inputs the external state s, and then outputs action A. The critic evaluation module can be regarded as the value of the brain, based on historical information and feedback R to adjust themselves, and then affect the entire actor action module. This actor-critic approach is very similar to the way humans behave. We humans also act under the guidance of our own values and instincts, and values are constantly changed by the influence of experience. Under the Actor-critic framework, Google DeepMind has put forward dqn,a3c and unreal depth-enhanced learning algorithms, in which unreal is the best depth-enhanced learning algorithm at present. Below we will introduce the basic ideas of these three algorithms. DQN (Deep Q network) algorithm

DQN was the first depth-enhanced learning algorithm that Google DeepMind introduced in 2013 and was further refined in 2015, published in Nature in 2015. DeepMind will apply DQN to the computer play Atari game, different from the previous practice, using only video information as input, and humans play the same game. In this case, based on the DQN program in a variety of Atari game has achieved beyond the level of human performance. This is the first time that the concept of depth-enhanced learning has been proposed, and thus began to develop rapidly.

DQN algorithm for relatively simple discrete output, that is, the output of the action only a few limited number. In this case, the DQN algorithm uses only the critic evaluation module under the Actor-critic framework and does not use the Actor action module, because the critic evaluation module allows you to select and perform the optimal action, as shown in Figure 3.


Fig. 3 Basic structure of DQN

In DQN, a value network (network) is used to represent the critic evaluation module, value network output q (s,a), which is the value of State S and action A. Based on the value network, we can traverse the value of various actions in a State s, and then select one of the most valuable action outputs. Therefore, the main problem is how to update the value network through the stochastic gradient descent method of depth learning. In order to use the gradient descent method, we must construct a loss function for the value network. Because value network output is Q value, so if we can construct a target Q value, we can get the loss function by means of the square difference MSE. But for the value network, the input information is only state S, action A and feedback R. Therefore, how to calculate the target Q value is the key of the DQN algorithm, which is the problem that the enhancement learning can solve. Based on the Bellman formula of reinforcement learning, we can construct the target Q value based on the input information, especially the feedback R, so as to get the loss function and update the value network.


Fig. 4 Unreal algorithm block diagram

In practical use, the value network can construct different network forms according to the specific problems. For example, Atari Some input is image information, you can construct a convolution neural network (convolutional neural network,cnn) as a value network. In order to increase the memory of historical information, we can add lstm long memory model after CNN. In the DQN training, the historical input and output information is collected as a sample in the experiential pool (replay Memory), then sampling multiple samples for Minibatch random gradient descent training by random sampling.

DQN algorithm as the first depth-enhanced learning algorithm, only use value network, training efficiency is low, need a lot of time training, and can only face the problem of low dimensional discrete control, the universality is limited. However, because the DQN algorithm first successfully combined with the depth of learning and enhance learning, to solve the problem of high dimensional data input, and in the Atari game to make breakthroughs, with pioneering significance. a3c (asynchronous Advantage Actor critic) algorithm

The A3C algorithm is a depth-enhanced learning algorithm proposed by DeepMind in 2015, which is better and more general than DQN. The A3C algorithm uses the actor-critic framework completely, and introduces the idea of asynchronous training, which accelerates the training speed greatly while improving the performance. The basic idea of the A3C algorithm, that is, the basic idea of actor-critic, is to evaluate the output of the action, if the action is considered good, then adjust the Action Network (Actor Network) to increase the likelihood that the action appears. Conversely, if the action is considered to be bad, then the likelihood of the action appears less. Through repeated training, constantly adjust the Action Network to find the best action. Alphago's self learning is also based on this idea.

Based on the basic idea of actor-critic, the value network of critic evaluation module can be updated by dqn method, so how to construct the loss function of the Action Network and train the network is the key of the algorithm. There are two ways to output a general Action Network: one is the probability, the other is the probability of an action, and the other is the deterministic way of outputting a specific action. A3C uses the method of probabilistic output. Therefore, we get the evaluation of the action from the critic evaluation module, that is, the value network, then use the log-likelihood value (log likelihood) of the output action multiplied by the action evaluation, as the loss function of the Action Network. The goal of the Action Network is to maximize the loss function, that is, if the action evaluation is positive, it increases its probability, and conversely decreases, conforms to Actor-critic's basic idea. With the loss function of the Action Network, the parameter can be updated by the way of random gradient descent.

In order to make the algorithm get better results, how to accurately evaluate the action is the key of the algorithm. Based on the action value Q, A3C uses the advantage A (Advantage) as an evaluation of the action. Advantage A is the advantage of action A in the state s relative to other actions. Assuming the value of the state S is V, then a=q-v. The Action value Q here refers to the value of the state s under a, which is different from the meaning of V. Intuitively, the advantage A is used to evaluate the action more accurately. For example, suppose that under state S, the Q value of action 1 is 3, the Q Value of action 2 is 1, and the value V of the state S is 2. If you use Q as an evaluation of actions, then the probability of action 1 and 2 will increase, but in fact we know that action 1 is the only thing that increases the probability of occurrence. If we take advantage a, we can calculate the advantage of Action 1 is 1, the advantage of Action 2 is 1. Based on the advantage a to update the network, the occurrence probability of action 1 increased, the probability of action 2 decreased, more in line with our goal. Therefore, the A3C algorithm adjusts the value network of the critic evaluation module, lets it output the V value, then uses the multi-step historical information to compute the Q value of the action, obtains the advantage A, then calculates the loss function, and updates the Action Network.

In order to improve the training speed, the A3C algorithm also adopts the idea of asynchronous training, that is to start multiple training environments at the same time, sampling, and directly using the collected samples for training. Compared with the DQN algorithm, the A3C algorithm does not need to use the experiential pool to store the historical samples, saves the storage space, and uses the asynchronous training, greatly doubles the data sampling speed, and thus improves the training speed. At the same time, using a number of different training environments to collect samples, the distribution of samples more uniform, more conducive to neural network training.

The A3C algorithm has been improved on the above links, which makes the average score of the Atari game is 4 times times of the DQN algorithm, and the training speed is increased exponentially. Therefore, the A3C algorithm replaces the DQN into a better depth-enhanced learning algorithm. Unreal (unsupervised reinforcement and auxiliary Learning) algorithm

The Unreal algorithm is a new depth-enhanced learning algorithm proposed by DeepMind in November 2016, which improves performance and speed on the basis of A3C algorithm. In the Atari game achieved 8.8 times times the level of human performance, and in the first view of the 3D Maze Environment Labyrinth also reached 87% of the human level, the current best depth enhancement learning algorithm.

The A3C algorithm makes full use of the actor-critic framework and is a perfect algorithm, so it is difficult to improve the algorithm by changing the algorithm framework. Based on the A3C algorithm, the unreal algorithm improves the algorithm by training several auxiliary tasks while training the A3C. The basic idea of unreal algorithm comes from the way of our human learning. People want to accomplish a task, often through the completion of a variety of other ancillary tasks to achieve. For example, we have to collect stamps, we can buy them ourselves, or we can have friends to help us get it, or to exchange it with other people. The unreal algorithm speeds up learning and further improves performance by setting up multiple auxiliary tasks while training the same A3C network.

In the unreal algorithm, two kinds of auxiliary tasks are included: The first is control task, including pixel control and hidden layer activation control. Pixel control refers to the control of the input image changes, making the largest image changes. Because the large image change often indicates that the agent in the implementation of important links, through the control of image changes can improve the choice of action. The hidden Layer activation control is the number of activations that control the hidden layer neurons, so that the more active it is, the better. This is analogous to the development of human brain cells, where the more neurons are used, the more intelligent they may be, and therefore the better they can make a choice. Another kind of ancillary task is the feedback prediction task. Because in many scenarios, feedback R is not available at all times (for example, an apple in a labyrinth can get 1 points), so allowing the neural network to predict the value of the feedback will make it more expressive. In the Unreal algorithm, we use the image input of historical continuous frames to predict the next feedback value as the training target. In addition to the above two kinds of feedback prediction tasks, the Unreal algorithm also uses historical information to increase the value iterative task, that is, the DQN Update method, and further enhance the training speed of the algorithm.

The unreal algorithm is essentially to improve the expressive ability and level of the Action Network by training multiple tasks oriented to the same final goal, which is in line with the human learning style. It is noteworthy that although Unreal increased the training task, but did not obtain other samples by other means, it is to maintain the original sample data unchanged in the case of the algorithm to promote, which makes the unreal algorithm is considered a unsupervised learning method. Based on the idea of Unreal algorithm, the auxiliary task can be designed according to the characteristics of different tasks to improve the algorithm. Summary

After nearly two years ' development, the depth-enhanced learning has obtained more and more good effect on the algorithm level. From dqn,a3c to Unreal, ingenious algorithmic design shines the light of human wisdom. In the future, in addition to the improvement of the algorithm itself, depth-enhanced learning as a universal learning algorithm that can solve the problem from perception to decision control will be widely used in various fields in real life. Alphago's success was just the eve of the general artificial intelligence explosion.


Original address: http://geek.csdn.net/news/detail/138103

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.