Murrisen Learning (I.) Enhancing learning

Source: Internet
Author: User
Tags dnn

Today I am honored to have the opportunity to share with you the topic of enhanced learning (reinforcement LEARNING,RL). This time, I hope to achieve the goal of three aspects:

First, I hope that no relevant background of the students can have a certain understanding of RL, so I will introduce some basic concepts.

Second, I hope that students with the background of machine learning algorithm, if you are interested in enhancing learning, can understand the RL recent progress.

Third, for me is also related to the collation of knowledge.

This sharing mainly includes the following three links:

some basic concepts of enhancing learning;

Deep Q Learning. DQN This work successfully put the depth of learning applied in the RL field, so will be a separate introduction;

Some recent advances have been made after deep Q learning. Basic Concepts

Machine Learning

First, what is the difference between enhanced learning and supervised learning?

They all have the word learning, so what is machine learning? According to Tom M. Mitchell, a computer program that accomplishes a task, if, with the increase in some form of experience (experience), behaves better and more in the measure of a performance metric (performance measure), We can say that this is a machine learning program.

So here, the machine learns three characteristic elements: T:task, e:experience, p:performance measure. Let me give you a counter example.

Suppose that the task now required is to predict the time spent on the road, based on the hours of work. We know that the peak will be traffic jams, home time a little longer, if overtime to night, the road will be more smooth. At this time, I wrote a program to complete this task: F (t) =1.0–0.5 * (t-18)/6

T is a moment, the value [18,24]; The return value represents the number of hours. Obviously, as long as the input is OK, the output of this program (algorithm) is determined, so it is not capable of learning. The reason is that there is no part of the algorithm that can be changed. So, I can write another program: F (t) =a–b * (t-18)/6

Among them, A,b is two variables, if our algorithm can let F (t) after the execution of multiple predictions, the a,b value, so that the result is more and more close to the real value, then the algorithm is capable of learning. Therefore, usually machine learning algorithms, which need to be part of the experience with the automatic adjustment. These can be changed parts, usually part of the algorithm model.

This introduces the concept of model. For more complex tasks, the true function of completing this task perfectly is usually unknown to us, neither knowing his parameters nor even knowing his structure. So let's make a hypothesis about the structure of the function, called the model, for example, we assume it's a problem that can be solved with a neural network (neural network), and then use this model to approximate the real function we want. There are usually many parameters in this model, that is, the degree of freedom that can be changed. These degrees of freedom form a space, and what we need to do is look in this space to find out which combination of parameters will make this model the closest to the real function we want.

In many machine learning algorithms, the process of learning is to find the optimal parameters of the model.

Supervised Learning

Let's look at an example of the first task, image classification (images classification). We know that if you have a set of data, you can use supervised learning methods to accomplish this task. So here, three elements of machine learning:

T:task is to mark each picture with a suitable label;

E:experience is a data set of many images that have already been labeled;

P:performance Measure is the accuracy of labeling, such as precision, recall these CV areas of measurement standards.

Supervision learns how to solve this problem. First, the establishment of a model, models of input is an image, output is the classification of the image label. Then, for the annotated image, use another function to measure the gap between the output of model and the correct output, which is often called the loss function (loss functions). Thus, the problem of improving model performance is transformed into an optimization problem: by changing the model parameters, the loss function value is getting closer to the minimum value. Optimization problem (optimization) is a relatively large topic, can be solved by a variety of numerical methods. From the point of view of optimization problem, the characteristic of supervised learning is that the loss function is constructed directly from the difference between model output and correct output.

Enhance Learning

Let's look at the second example of a task and play the game. The goal of this task is to get as high a score as possible. In this task, there is usually no longer a "tagged data" for use: In the first case, it should be left, and in the fourth situation, it should be to the right, and so on. In this case, we can use the enhanced learning method to complete this task. Three elements of machine learning:

T:task is the end of the game, get as high a score as possible;

E:experience is constantly trying to play this game;

P:performance measure is the last score to get.

You can see the difference between the two examples above. There are a number of similar problems:

For example, go, or a lot of chess game to the last moment to decide to win or lose, I know I was winning (score + 1), or lose (Score-1), but the middle may not be every moment there is a direct signal to tell me where to go;

For example, driving, successfully reached the target, there is no accident or traffic violation, even if the successful completion of the task (+1), the middle may not be every moment there is an expert to guide the current should be how to advance;

For example, robot control, the robotic arm successfully crawled to the object even if the completion (+1), but do not know in advance of each point in time, each case, the robot arm each motor should enter the voltage is how much.

Based on the example we have just shown, we refine the general process of enhancing learning, the modules involved:

The Agent, which includes the model we need to learn, is to output an action (action) through the input state ("Environment"), which is used to interact with this action and the Environment (reward) to collect the reward. , the goal of learning is to maximize the cumulative feedback.

As you can see, under the general definition:

T:task is to perform a task to get an optimal or better strategy in a given environment.

E:experience is what I constantly interact with the environment, the resulting interaction, the single interaction we call it transition, and the continuous sequence of interactions is also called trajectory. A complete trajectory is from the initial state, to the end state (if any). A task that has no end state, the trajectory can be extended indefinitely, and a task with an end state is also called a episodic task.

P:performance Measure: In the RL, the commonly used performance Measure is called discounted Future reward.

Discounted Future reward

The discounted Future reward is defined as follows:

Which is the last line. We hope that the agent can give a strategy (Policy): Under a state, the output action, so that the received discounted Future reward maximized.

Normally gamma picks a value that is less than 1 or greater than 0. If you take 0, the equivalent of a strategy is to take into account the rewards that are collected in the current step and the long-term rewards, and if you take 1, it's not good to deal with persistent (infinitely long) decision sequence problems.

Policy, Value, Transition Model

In the enhancement study, the more important several concepts:

Policy is the goal that our algorithm pursues, and can be seen as a function that returns the probability distribution of the action or action that should be performed at this time when the state is entered.

Value, which represents the discounted future reward (expected) value that can be returned under state to perform this action in the input state,action.

Transition model is the structure and characteristics of the environment itself: when the state executes the action, the system enters the next state, including the reward that may be received.

So obviously, the above three are interrelated. If you can get a good policy function, the purpose of the algorithm has been achieved. If you can get a good value function, then in this state, select the value of the higher action, natural is also a good strategy. If you can to a good transition model, on the one hand, it is possible to use this transition model direct performance of the best strategy, on the other hand, can also be used to guide the policy function or value function The learning process.

Therefore, the enhancement of learning methods, can be divided into three categories:

Value-based RL, Value method. By explicitly constructing a model to represent the value function Q, finding the Q function corresponding to the optimal strategy, the optimal strategy is found naturally.

Policy-based RL, strategy method. Explicitly constructs a model to represent the policy function, and then seeks to maximize discounted future reward.

Model-based RL, an environmental model based approach. First get the model about environment transition, and then look for the best strategy based on this model.

The above three methods are not a strict division, many of the RL algorithm has more than one feature.

Bellman Equation

According to the definition of Q above R (discounted Future reward), we can get Bellman equation:

Bellman equation is a very basic formula in RL. Because for the value Based method, Bellman equation gives an iterative way to improve the Q function:

For policy Based method, the Q value is usually estimated by sampling, and Bellman equation plays an important role in various estimation methods.

Value Based method

Here is an example of an algorithm for a value method: Q Learning

Policy Based method

An example of a strategy approach is given here: policy gradient

The so-called policy gradient refers to when the policy function is differentiable, by finding a suitable loss function, the model of the policy function can be improved iteratively by using the gradient based optimization method. There are many forms of loss function construction, the above example is a relatively basic form, where it is not expanded, but the basic effect is to make the policy function tend to output the "better" action: for a random strategy, That means increasing the probability of an action that gets a larger desired Q value.
Deep Q Learning

Review the example of Q learning previously cited. The earliest q-learning uses a table to represent the Q function: For each state, the action pair holds a corresponding Q value. The process of Q learning is the process of iterative modification of this table. It can be found that the table does not apply when the state is very large, or is no longer enumerable, and the dimension is very high, and a more economical model is needed to simulate the table.

The Deep Neural Network (Deep Neural Network), which has been proven to be a very powerful function-fitting tool in the field of computer vision, can be used to express very complex visual processing functions to handle high-dimensional image input. The method based on DNN has become the best solution for many problems in computer vision. So, is it possible to dnn the more powerful tools and introduce the RL field to solve more difficult problems?

Deep Q Learning This work is to introduce DNN/CNN into the RL as a Q function model.

The loss function of DNN is defined as follows:

This loss function is also drawn from Bellman equation, which

Also known as temporal difference,td. As you can see, this is also an iterative process: The new modified Q value is obtained by the output of the current Q function and the reward R collected in the environment.

New challenges

However, the introduction of neural networks has also raised new problems.


The first problem is that neural networks are expensive to compute, so the form of a function is adjusted to the right from the left of the above figure, and one calculation can get the Q value of all the action under this state.

The second problem is that in the RL, the data of the transition sequence obtained by the agent and the environment interaction are not distributed independently, and for the training of neural networks, the independent distribution of training data sampling is very necessary. So, here we introduce a replay memory mechanism to save transition to a certain size replay memory, and then resample from it. In this way, the correlation between the samples is much smaller.

The third problem is that since the new Q value is calculated from the original Q value, the updated neural network is also used to calculate the new round of Q value, for the neural network, equivalent to the continuous fitting of a moving target value, may produce a positive feedback effect, let the training process become divergent. The solution to this problem is to add a target network targeting to compute the new Q value, so that the target network is updated to the current network parameters over time, so the target network remains unchanged between the two updates, and the mobility of the network fitting target is alleviated.

In DQN this work, the use of convolution neural network (CNN) to deal with the input of the image, the final output of the game's action instructions, the successful implementation of the game only based on the screen output and reward to learn to play the game.

Here, the adjacent 4 frames are merged together as a state so that the neural network has the opportunity to extract information from multiple frames, such as the speed of objects on the screen.

It can be seen that in many games, DQN have reached the level of human operation.


To be continued

Original address:  https://www.leiphone.com/news/201707/kkcminb5hnacxwjw.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.