"Reprinted" Enhancement Learning (reinforcement learning and Control)

Source: Internet
Author: User

Enhanced Learning (reinforcement learning and Control) [PDF version] enhanced learning. pdf

In the previous discussion, we always given a sample x and then gave or didn't give the label Y. The samples are then fitted, classified, clustered, or reduced to a dimension. However, for many sequence decisions or control problems, it is difficult to have such a regular sample. For example, the four-legged robot control problem, at first did not know should let it move that leg, in the process of movement, also do not know how to let the robot automatically find the right direction.

In addition, if you want to design a chess ai, each step is actually a decision-making process, although for simple chess has a * heuristic method, but in the complex situation, still need to let the machine to consider a few more steps behind to decide which step is better, so need better decision-making method.

There is such a way of thinking about this kind of control decision. We design a return function (reward functions), if the learning agent (such as the above four-legged robot, chess AI program) after the decision to obtain a better result, then we give agent some return (such as return function result is positive), get poor results , then the return function is negative. For example, a quadruped robot, if he moves a step forward (close to the target), then the return function is positive and the back is negative. If we can evaluate each step and get the corresponding return function, then it's good to do it, we just need to find a path with the largest return value (the sum of the returns per step), which is considered the best path.

Enhanced learning has been successfully applied in many fields, such as automatic helicopters, robot control, mobile network routing, market decision making, industrial control, and efficient web indexing.

Next, let's introduce the Markov decision process (Mdp,markov decision processes).

1. Markov decision-making process

A Markov decision-making process consisting of a five-tuple

* S represents the state set (states). (for example, in an automated helicopter system, the helicopter's current position coordinates constitute a state set)

* A represents a set of actions (actions). (for example, use the joystick to steer the helicopter in the direction of the flight, so that it forwards, backwards, etc.)

* is the probability of a state transition. The transition of one State in S to another State requires a to participate. Represents the probability distribution of the other states that have been transferred to in the current state, after a function has been performed (the current state may jump to many states after executing a).

* is the damping coefficient (discount factor)

*, R is the return function (Reward function), and the return functions often write s functions (only with s), so that R is re-written.

The dynamic process of MDP is as follows: The initial state of an agent is, then pick an action from a to execute, after execution, the agent is randomly transferred to the next state by probability. Then perform an action, move on, and then execute ... we can use the following diagram to represent the entire process.

If you have a good understanding of hmm, it's easier to understand.

We define the sum of the return functions that go through the above transfer path as follows

If R is only related to S, then the upper can be written

Our goal is to select the best set of actions, so that all returns are weighted and expected the most.

From the above can be found that in the t moment of the return value is played a discount, is a gradual decay process, the more the back of the state on the return and impact of the smaller. Maximize the expected value is to put the big as far as possible to the front, small as far as possible to put on the back.

When it is already in a certain state, we will select the next action A to execute with a certain strategy and then transition to another State s '. We call this action selection process a policy, and each policy is actually a mapping function of State to action. Given is given, that is, knowing the next step in each state of the action should be performed.

In order to distinguish between good and bad, and defined in the current state, after the execution of a policy, the result is good or bad, you need to define the value functions (value function) is also called the conversion cumulative return (discounted cumulative reward)

As you can see, in the current state S, after choosing a good policy, the value function is the return weighting and expectation. This is actually very easy to understand, given also given a future action plan, this action plan will go through a state, and arrive in each state will have a certain return value, the closer the current state of the other State to the impact of the scheme, the higher the weight. This is similar to chess, under the current chess game, different sub-programs are, we evaluate each programme depends on the future situation (,,... ) of judgment. In general, we will consider a few more steps in our mind, but we will focus more on the next situation.

From the point of view of recursion, the value function of the current state S V, in fact, can be regarded as the return of the present state of R (s) and the value of the next state of the function V ' the sum, that is, to change the formula:

However, we need to note that although given, under a given state S, a is unique, but may not be more than one mapping. For example, if you choose a to throw a dice forward, then there may be 6 of the next state. Again by the Bellman equation, obtained from the upper form

S ' indicates the next state.

The front R (s) is called immediate return (immediate reward), which is R (the current state). The second item can also be written, which is the expectation of the next state value function, and the next state s ' conforms to the distribution.

As you can imagine, when the number of States is limited, we can find the V of each s by means of the above (the end state does not have a second (s)). If a linear equation group is listed, that is | s| a equation, | s| an unknown, direct solution can.

Of course, we seek the purpose of V is to find a current state s under the optimal action strategy, the definition of the optimal v* is as follows:

is to pick an optimal strategy from an optional strategy (discounted rewards Max).

The Bellman equation form for the above formula is as follows:

The first item is irrelevant, so it does not change. The second is a decision on the next action of each State S, a, after performing a, s ' probability distribution of the probability of return and expectations.

If the above is not easy to understand, you can refer to:

Defining the optimal v*, we then define the optimal strategy as follows:

By selecting the optimal one, the next optimal action of each state S is determined.

According to the above, we can know

Explain that the optimal value function for the current state is v*, which is obtained by adopting the optimal execution strategy, and the return of the optimal execution scheme is obviously better than the other execution strategy.

It is important to note that if we are able to get the optimal a for each s, then the mapping from the global perspective can be generated, and the resulting mapping is the optimal mapping, called. For the global S, the next action A for each S is determined, and will not vary depending on the initial state S selection.

2. Value Iteration and strategy iteration method

In the last section we give an iterative formula and an optimization target, this section discusses two effective algorithms to solve the finite state MDP specific strategy. Here, we only aim at the case that MDP is finite state, finite action.

* Value Iterative method

1. Initialize each S's V (s) to 0

2. Loop until convergence {

For each state s, update to V (s)

}

The value iteration strategy leverages the formula in the previous section (2)

The implementation of the inner loop has two strategies:

1. Synchronous Iterative Method

For the first iteration after initialization, all V (s) in the initial state is 0. Then the new V (s) =r (s) +0=r (s) is computed for all S. When each state is computed, a new V (s) is obtained, which is saved and not immediately updated. After all the new values of s (s) are computed, the update is unified.

Thus, after the first iteration, V (s) =r (s).

2. Asynchronous Iterative method

corresponding to the synchronous iteration is the asynchronous iteration, for each state s, gets a new V (s), not stored, directly updated. Thus, after the first iteration, most of the V (s) >r (s).

Whichever of these two types is used, eventually V (s) converges to v* (s). Knowing the v*, we use the formula (3) to find out the corresponding optimal strategy, of course, can be found in the process of seeking v*.

* Strategy Iteration method

The value iteration method converges the V value to v*, while the strategy iteration method focuses on the convergence.

1, a mapping of s to a is randomly assigned.

2. Loop until convergence {

(a) to make

(b) For each State s, do update

}

(a) The V in step can be obtained by the previous Bellman equation

This step will find out all the state S.

(b) The step is actually to pick out the current state s under the (a) step, the optimal a, and then do the update.

For value iterations and policy iterations it's hard to say which method is good and which is bad. For a smaller MDP, strategies tend to converge faster. However, for MDP with large scale (many states), it is easier to iterate over values (not to ask for a linear equation group).

3. Parameter Estimation in MDP

In the MDP discussed earlier, we are known for the state transition probability and the return function R (s). But in many practical problems, these parameters cannot be explicitly obtained, we need to estimate these parameters from the data (usually s, a and is known).

Let's say we know that many of the state transition paths are as follows:

Wherein, is the I moment, the state of the section J transfer path corresponds to the state of the action to be performed. The number of states in each transfer path is limited, and during the actual operation, each transfer chain enters the final state, or the specified number of steps is terminated.

If we get a lot of similar transfer chains (equivalent to a sample), then we can use the maximum likelihood estimate to estimate the state transition probability.

The number of times that a molecule is reached S ' after performing an action a from S state, and the denominator is the number of times a is executed at state S. The division of the two is the probability that after a is executed in S state, it will be transferred to S '.

To avoid a denominator of 0, we need to do a smoothing. If the denominator is 0, then, that is, when the sample does not appear in the S state to execute a, we consider the transfer probabilities to be evenly divided.

The above estimation method is estimated from historical data, and the same formula applies to online updates. For example, if we get some new transfer paths, we can modify the numerator denominator of the above formula (plus the newly obtained count). After the correction, the transfer probability has changed, according to the probability of change, there may be more new transfer path, this will be more and more accurate.

Similarly, if the return function is unknown, then we think R (s) is the average value of the return that has been observed in the S state.

When the transfer probability and return function is estimated, we can use the value iteration or the strategy iteration to solve the MDP problem. For example, the process of combining parameter estimation with value iterations (without knowing the probability of state transitions) is as follows:

1. Random initialization

2. Loop until convergence {

(a) The number of transitions per state in the sample, used to update and R

(b) Use the estimated parameters to update V (using the value iteration method of the upper section)

(c) Re-drawing on the updated V

}

In step (b) We are going to do a value update, which is also a loop iteration, in the previous section we solved v by initializing v to 0 and then iterating. After nesting to the above procedure, if you initialize v at 0 each time and then iterate over the update, it will be slow. One way to speed up is to initialize V to the V that was obtained in the last cycle. This means that the initial value of V is linked to the previous result.

4. Summary

First of all, the MDP we discussed here is a non-deterministic Markov decision-making process, that is, the return function and the action conversion function are probabilistic. In the state S, it is also a probability to take the next state s ' after the action A is transferred to. Thirdly, an important concept in reinforcement learning is Q learning, which essentially converts the V (s) associated with state s to the Q associated with a. Tom Mitchell's last chapter, "Machine learning", is highly recommended, which introduces Q learning and more content. Finally, it mentions the Bellman equation, in the "Introduction to the algorithm" has the Bellman-ford dynamic programming algorithm, can be used to solve the graph with negative weight of the shortest path, the most worthy of discussion is the convergence of proof, very valuable. Some scholars have carefully analyzed the relationship between reinforcement learning and dynamic programming.

This is the last article in the NG handout, but also a learning theory, temporarily do not intend to write, feel learning understanding is not deep. Wait until you have finished learning the graph model and online learning, and then go back to writing learning theory. In addition, Ng's handout also has a number of mathematical basic aspects such as probability theory, linear algebra, convex optimization, Gaussian process, hmm, etc., are worth looking at.

"Reprinted" Enhancement Learning (reinforcement learning and Control)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.