This is the summary note of the first lesson of David Silver's intensive learning public class. The first lesson mainly explains the embodiment of reinforcement learning in many fields, mainly solves what problem, differs from supervised learning algorithm, which part of complete algorithm flow consists of, and what content the agent contains, and explains some concepts involved in reinforcement learning.
"Reprint please indicate the source" Chenrudan.github.io
This is the summary note of the first lesson of David Silver's intensive learning public class. The first lesson mainly explains the embodiment of reinforcement learning in many fields, mainly solves what problem, differs from supervised learning algorithm, which part of complete algorithm flow consists of, and what content the agent contains, and explains some concepts involved in reinforcement learning.
This lesson video address: RL Course by the David silver-lecture 1:introduction to reinforcement Learning.
This course PPT address: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/intro_RL.pdf.
The content of the article is a summary and discussion of the course, which will be organized according to your own understanding. Lack of personal knowledge and English listening is not so good may have some understanding of the place, welcome to discuss. 1. What is reinforcement learning?
Reinforcement learning is a product of multidisciplinary interdisciplinary, and its essence is to solve the problem of "decision making", that is, to learn to make decisions automatically. The machine learning algorithm is embodied in the Computer science field. The engineering field is embodied in the decision of the sequence of actions to get the best results. In the neuroscience field of understanding how the human brain makes decisions, the main research is reward system. In the Psychology field, we study how animals make decisions, and what causes animal behavior. In the economics field is embodied in the research of game theory. All this boils down to the question of why people can and how to make optimal decisions.
Reinforcement learning is a sequential Decision making problem, which requires continuous selection of behaviors, which result in the best results when these behaviors are completed. It does not have any label to tell the algorithm how to do, by first try to do some behavior to get a result, by judging whether the result is right or wrong to the previous behavior feedback, and then by this feedback to adjust the previous behavior, through constant adjustment, The algorithm can learn to choose what kind of behavior in the circumstances to get the best results.
There are a lot of differences between intensive learning and supervised learning, first of all, there is a label for the study, this label tells the algorithm what input corresponds to what kind of output, and reinforcement learning does not have a label to tell it in some cases what kind of behavior, Only a reward signal, which eventually returns after a series of actions, can determine whether the current choice is good or bad. Second, the results of reinforcement learning feedback has a delay, sometimes may need to walk a lot of steps to know before a certain step of the choice is good or bad, and the supervision of learning to do a relatively bad choice will immediately feedback to the algorithm. The input of reinforcement learning is always changing, and input is not independent and distributed as supervised learning. And whenever an algorithm makes an action, it affects the input of the next decision. 2. Reinforcement Learning Composition
Figure 1 Reinforcement Learning component (photo source [1])
The process of strengthening learning decision is shown above. You need to construct an agent (the part of the brain in the diagram) that can perform an action, such as deciding which direction the robot is going to go, and where the chess piece is under. The agent is able to receive a observation of the current environment, such as the current robot camera shot to the scene. The agent can also receive a reward when it performs an action, that is, the workflow for the T Step agent is to perform an action atat, obtain the environmental observation otot after the action, and obtain feedback reward RTRT for this action. The environment environment is the object of agent interaction, it is an uncontrollable object, the agent at the beginning did not know how the environment will react to different actions, and the environment through observation tell agent the current state of the environment, At the same time, the environment can be feedback to the agent based on the possible final results, such as the chess surface is a environment, it can be based on the current level of chess reward to estimate the ratio of the two sides win or lose. Thus, in step T, Environment's workflow is to receive a atat that responds to the action by transmitting environmental conditions and evaluating the reward to the agent. Reward Reward Rtrt, is a feedback scalar value, which indicates how good or bad the decision is made in step T, and the goal of the whole reinforcement learning optimization is to maximize the cumulative reward. For example, in a shooting game, hit an enemy aircraft, the final score will increase, then this step reward is positive. 3. Some variables
History is the sequence of all movements, states, rewards, Ht=a1,o1,r1,..., at,ot,rtht=a1,o1,r1,..., at,ot,rt
Environment State,setste, the current state of the environment, it reflects what the environment has changed. What needs to be understood here is that the state of the environment itself and the environment feedback to the agent are not necessarily the same, such as when the robot is walking, the current environment state is a definite position, but its camera can only be photographed around the scene, can not tell the agent specific location, The photos taken can be regarded as a observation to the environment, that is, the agent does not always know how the environment changed, only to see a result of the change show.
Agent State,sat