Introduction to Reinforcement learning first, Markov decision process

Source: Internet
Author: User
Tags sleep

Introduction to Reinforcement learning first, Markov decision process

The formation of reinforcement learning algorithm theory can be traced back to the 780 's, in recent decades the reinforcement learning algorithm has been silently progressing, the real fire is the last few years. The representative event was the first demonstration by the DeepMind team in December 2013 that the machine used the enhanced learning algorithm to defeat human professionals in the Atari game, the results of which were released in 2015 in the top journal Nature; in 2014, Google will deepmind team acquisitions. In March 2016, the Alphago program developed by DeepMind used the reinforcement learning algorithm to beat the world go Master Li Shishi by 4:1, and intensified learning algorithm attracted more scholars ' attention. Today, reinforcement learning algorithms have blossomed in areas such as gaming, robotics, and more. Major technology companies, such as Google, Facebook, Baidu, Microsoft and so on, are strengthening learning technology as one of its key development technologies. It can be said that the reinforcement learning algorithm is changing and affecting the world, mastering the technology has mastered the change of the world and the impact of the world's tools.

Now there are some intensive learning tutorials on the web, all from the world's top universities, such as the 2015-year-old David Silver Classic course teaching, 2017 UC Berkeley Levine, Finn, Schulman course CS 294 deep Reinforcement learning, Spring 2017, Carnegie Mellon University's 2017 Spring course, deep RL and Control. As far as I know, there is not a corresponding Chinese course in China, in order to fill this gap, so decided to write this Chinese handout. Due to the limited personal level, there will inevitably be deviations of understanding, some places will be overlooked, the wrong place to look at you forgive me, but also welcome you to exchange, criticism, correction. form is another purpose of this writing. Published a QQ exchange group 202570720, there are questions in the group message.

Figure 1.1 Reinforcement Learning principles Explained

Figure 1.1 explains the fundamentals of reinforcement learning. When a task is completed, the agent interacts with the surrounding environment first through action A, and with the action A and environment, the intelligent experience creates a new state, and the environment gives an immediate return. In this cycle, the agent interacts with the environment to generate a lot of data. The Reinforcement learning algorithm uses the resulting data to modify its own action strategy, then interact with the environment, generate new data, and use the new data to further improve its behavior, after several iterations of learning, the intelligent physical energy finally learned to complete the corresponding task of the optimal action (optimal strategy).

From the basic principles of reinforcement learning, we can see some fundamental differences between reinforcement learning and other machine learning algorithms such as supervised learning and unsupervised learning. In supervised learning and unsupervised learning, the data is static and does not need to interact with the environment, such as image recognition, as long as the sample is enough to input the data into the deep network for training. However, the learning process of reinforcement learning is a dynamic and interactive process, and the data required is also generated by the continuous interaction with the environment. Therefore, compared with supervised learning and unsupervised learning, intensive learning involves more objects, such as motion, environment, state transition probability and return function. Reinforcement learning is more like the process of human learning, human beings by interacting with the surrounding environment, learned to walk, run, labor. Human and nature, interacting with the universe, created a modern civilization. In addition, deep learning such as image recognition and speech recognition solves the problem of perception, and reinforcement learning solves the problem of decision-making. The ultimate goal of AI is to make intelligent decisions through perception. So, it is a promising way to realize the ultimate goal of artificial intelligence to combine deep learning technology with reinforcement learning algorithm in recent years.

Countless scholars, through decades of continuous efforts and exploration, put forward a set of solutions to solve most of the reinforcement learning problem framework, which is Markov decision-making process, referred to as MDP. In the following, we will introduce Markov decision-making process progressively: Markov is introduced first, Markov process is introduced, and Markov decision-making process is introduced finally.

The first concept is Markov: the so-called Markov nature refers to the system's next state is only related to the current state, and the previous state is not relevant.

Definition: The state is Markov, when and only if.

As you can see in the definition, the current state contains all the relevant historical information, and once the current state is known, the historical information will be discarded.

Markov properties describe the nature of each state, but what is really useful is how to describe a sequence of states. The subject of mathematics used to describe the sequence of random variables is called stochastic process. The so-called stochastic process refers to the sequence of random variables. If each state in the sequence of random variables is Markov, the stochastic process is called Markov stochastic process.

The second concept is a Markov process

Markov process definition: Markov process is a two-tuple, and satisfies: is a finite state set, is the state transition probability. The state transition probability matrix is:

。 Let's take an example to illustrate this.

Fig. 1.2 Example diagram of Markov process

As shown in Figure 1.2, 7 states of a student {entertainment, Course 1, Course 2, course 3, exam, sleep, thesis}, the probability of conversion between each state is known. The possible status sequence for a day starting from Lesson 1 is:

Lesson 1-Lesson 2-Lesson 3-Exams-sleeping

Lesson 1-Lesson 2-Sleeping

The above state sequence is called a Markov chain. When given state transition probability, there are multiple Markov chains from a certain state. For games or robots, Markov processes are not enough to describe their characteristics, because both the game and the robot interact with the environment and receive rewards from the environment, and there is no action or reward in the Markov process. The Markov process, which takes action (strategy) and return into account, is called Markov decision process.

A third concept is Markov decision process

Markov decision process is described by a tuple, where: a finite set of states, a finite set of actions, a state transition probability, a return function, and a discount factor, used to calculate the cumulative return. Note that, unlike Markov processes, the state transition probabilities of Markov decision making processes include actions:

As an example:

Figure 1.3 Markov decision process example diagram

Figure 1.3 is a sample diagram of the Markov decision process, and figure 1.3 corresponds to figure 1.2. In Figure 1.3, the student has five states, the state set for, the action set for a={play, exit, study, paper, sleep}, in Figure 1.3 immediately return with a red flag.

The goal of reinforcement learning is to give a Markov decision-making process and find the optimal strategy. The so-called strategy refers to the state-to-action mapping, which is a common symbol of the strategy, which refers to a distribution on the action set in a given state, i.e.

(1.1)

What does this formula mean? The definition of a strategy is given using conditional probability distributions. I believe that when it comes to the probability formula, most people will click stops in the heart, the rejection of the feeling arises spontaneously. However, in order to fully grasp the reinforcement learning tool, the probability formula is essential. Only by mastering the probability formula can we really comprehend the essence of reinforcement learning.

The important role of probability in reinforcement learning is explained simply. First of all, the strategy of reinforcement learning is often a stochastic strategy. The advantage of adopting a stochastic strategy is that you can couple the exploration to the sampling process. The so-called exploration refers to the robot trying other actions to find a better strategy. Secondly, in the actual application, there are various noises, these noises are usually normal distribution, how to remove these noises also need to use the knowledge of probability.

To the next, the formula (1.1) means that the strategy specifies an action probability in each state. If the given policy is deterministic, then the policy specifies a definite action in each state.

One student's strategy, for example, is that the student has a 0.8 chance of playing in the state, and the probability of not playing is 0.2, apparently the student prefers to play.

Another student's strategy is that the student's probability of playing in the state is 0.3, apparently the student does not like to play. And so on, every student has his own strategy. Reinforcement learning is to find the optimal strategy, where the best is the total return of the largest.

When given a strategy, we can calculate the cumulative return. First define the cumulative return:

When given a policy, assume that the student status sequence may be:

At this point, under the policy, the cumulative return can be calculated using the (1.2) formula, at which time there are multiple possible values. Because the strategy is random, the cumulative return is also random. To evaluate the value of the state, we need to define a definite amount to describe the value of the state, and the natural idea is to use the cumulative return to measure the value of the state. However, the cumulative return is a random variable, not a deterministic value, and therefore cannot be described. But its expectation is a deterministic value that can be defined as a function of state values.

State value functions:

When an agent employs a strategy, the cumulative return is subject to a distribution, and the expected value of the cumulative return at the state is defined as a state-value function:

(1.3)

Note: The state value function corresponds to the policy because the policy determines the state distribution of the cumulative return G.

Figure 1.4 Schematic diagram of the state value function


Figure 1.4 shows the state-value function diagram corresponding to figure 1.3. The number in the white circle in the figure is the value function in that state. That


Accordingly, the state-behavior value function is:

(1.4)

Formula (1.3) and formula (1.4) respectively give the definition formula of State value function and state-behavior value function, but it is not programmed according to the definition in actual calculation and programming. Next we will interpret the definitions from different aspects.

Behrman equation of State value function and state-behavior value function

The definition of a state-valued function (1.3) can be obtained by:

(1.5)

A supplementary proof of the last equals sign:

It is important to be aware of which variables to expect.


We can also get the Behrman equation for the state-action Value function:

The specific derivation process of state-value function and state-behavior value function:

Figure 1.5 and Figure 1.6 are the specific calculation procedures for the state value function and the behavior value function, respectively. Where a hollow circle represents a state, a solid circle indicates a state-behavior pair.

Figure 1.5 Schematic diagram of a state-valued function

Figure 1.5 is a calculation of the value function of the schematic diagram, figure 1.5B calculation formula is:

Figure 1.5B shows the relationship between the state value function and the state-behavior value function. Figure 1.5C Calculation state-The Behavior value function is:

The (1.8) into the (1.7) formula can be obtained:


Figure 1.6 State-Behavior value function calculation

In 1.6C,

Bring (1.10) into (1.8) to get the behavior state-Behavior value function:

The formula (1.9) can be verified in Figure 1.4. Select Status

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.