Reinforcement Learning and Control)

Source: Internet
Author: User
Enhanced Learning

In the previous discussion, we always give a sample X, and then give or not to label y. Then, the samples are fitted, classified, clustered, or downgraded. However, for many sequence decision-making or control problems, it is difficult to have samples with such rules. For example, the control problem of the Four-legged robot is that at the beginning, it does not know how to let the robot automatically find the proper way forward.

In addition, to design a chess AI, every step is actually a decision-making process. Although there is a * Heuristic Method for simple chess, when the situation is complex, it is still necessary for the machine to consider several more steps to determine which step is better. Therefore, a better decision-making method is required.

There is such a solution to this type of control decision-making problem. We designed a Reward function, if the learning agent (the above four-foot robot, chess AI)Program) After deciding on a step and obtaining a good result, we will give the agent some return (for example, if the return function result is positive) and get a poor result, then the return function will be negative. For example, if a four-legged robot takes a step forward (approaching the target), the return function is positive and the return function is negative. If we can evaluate each step and obtain the corresponding return function, we can easily find the path with the highest return value (the maximum sum of the return values in each step ), it is considered to be the best path.

Reinforcement Learning has been successfully applied in many fields, such as automatic helicopter, Robot Control, mobile network routing, Market Decision-making, industrial control, and efficient Web indexing.

Next, we will first introduce the Markov decision-making process (MDP, Markov demo-processes ).

1. Markov decision-making process

A Markov Decision-making process consists of a quintuple

* S indicates the status set (states ). (For example, in an automatic Helicopter System, the coordinates of the current helicopter position constitute a State set)

* A indicates a group of actions ). (For example, a helicopter's flight direction operated by a control lever is used to forward or backward the helicopter)

* Indicates the probability of state transfer. In S, A is involved in the transition from one state to another. It indicates the probability distribution of other States that will be transferred in the current State after the action (the current State may jump to many States after executing ).

* Is the damping factor)

* R is a reward function, which often writes S functions (only related to S). In this case, R writes again.

The MDP dynamic process is as follows: the initial status of an agent is, and then an action is selected from a for execution. After the execution, the agent is randomly transferred to the next status based on probability ,. Then, execute another action, and then execute another action ..., The following figure shows the entire process.

It is easier to understand HMM.

After the above transfer path is defined, the sum of the return functions is as follows:

If R is only related to S, you can write

Our goal is to select an optimal group of actions to maximize the weighting and expectation of all returns.

From the above formula, we can find that the discount on the return value at t time is a gradual decay process, and the lower the return value, the less impact it has. To maximize the expected value, we need to put the big value at the front and the small value at the back.

When it is in a certain State S, we will use a certain policy to select the next action a to execute, and then switch to another State s '. We call this action selection process a policy. Each policy is actually a ing function from the State to the action. Given is given, that is, knowing the action that should be performed next in each State.

To distinguish between good and bad, and define the quality of the result after executing a policy in the current state, we need to define a value function) also called cumulative return (discounted cumulative reward)

As you can see, in the current State S, after selecting the policy, the value function is the return weighting and expectation. In fact, this is easy to understand. Given a future action plan, this action plan will go through different states, and each State will return a certain value, the greater the impact of other States that are closer to the Current Status on the scheme, the higher the weight. This is similar to playing chess. In the current game board, different sub-programs are: We evaluate each solution based on the future situation (,,...) . Generally, we will consider several more steps in our minds, but we will pay more attention to the next situation.

From the perspective of recursion, the value function v of the current State S can be regarded as the sum of the return r (s) of the current state and the Value Function V of the next state, that is, the above formula is changed:

However, it should be noted that, although given, under the given State S, A is unique, it may not be a ing from multiple to one. For example, if you choose a to throw a dice to the forward, there may be six of the following states. The Bellman equation is used to obtain

S' indicates the next status.

The preceding R (s) is called the immediate return (immediate reward), that is, R (current state ). The second item can also be written, which is the expected value of the next state value function, and the next State S is consistent with the distribution.

As you can imagine, when the number of States is limited, we can use the above formula to find the V of each s (the final state has no second V (s ')). If the linear equations are listed, that is, | S | equations, | S | unknown, you can solve them directly.

Of course, the purpose of V is to find the optimal action strategy in the current State S and define the optimal v * as follows:

It is to select an optimal policy (maximum discounted rewards) from the optional policy ).

The preceding Bellman equation is as follows:

The first item is irrelevant, so it remains unchanged. The second one determines the next action a of every State S. After executing a, s' returns probability and expectation based on probability distribution.

If the above formula is not easy to understand, you can refer:

The optimal v * policy is defined. The optimal policy is defined as follows:

When the optimal action is selected, the next optimal action a of every State S is determined.

Based on the above formula, we can know

This is the optimal value function v * in the current state, which is obtained by adopting the optimal execution policy, the return of the optimal execution scheme is obviously better than that of other execution strategies.

Note that if we can obtain the optimal a under every second, the ing can be generated globally, and the generated ing is called the optimal ing. For the global S, the next action a of every S is determined, and it will not be different because the initial State S is selected.

2. Value Iteration and Policy Iteration

In the previous section, we provide iteration formulas and optimization objectives. This section discusses the effectiveness of the two methods for solving finite state MDP.Algorithm. Here, we only aim at the situation where MDP is finite and limited ,.

* Value Iteration Method

1. initialize the V (s) of every second to 0.

2. loop until convergence {

For every status S, V (s) is updated.

}

The Value Iteration policy utilizes the formula (2) in the previous section)

There are two strategies for implementing the internal loop:

1. synchronous Iteration Method

Take the first iteration after initialization for example, all V (s) in the initial state are 0. Then calculate the new V (S) = R (s) + 0 = R (s) for all S ). When calculating each status, get a new V (s), save it first, and do not update it immediately. After all the new values of S (V (s) are calculated, they are updated in a unified manner.

In this way, V (S) = R (s) after the first iteration ).

2. asynchronous Iteration

The synchronization iteration corresponds to the asynchronous iteration. After a new V (s) is obtained for every State S, the data is directly updated without being stored. In this way, after the first iteration, most V (s)> r (s ).

Regardless of the two types, the V (s) will eventually converge to V * (s ). After knowing v *, we can use formula (3) to find the optimal policy. Of course, we can find it in the process of finding v.

* Iterative strategy

The value iteration method converges the V value to V *, while the policy iteration method focuses on the convergence.

1. Map A Random S to.

2. loop until convergence {

(A) Order

(B) Update every status S.

}

(A) V in step can be obtained through the Bellman equation.

In this step, all States are obtained.

(B) Step a selects the optimal A under the current State according to the result of step (a), and then updates it.

It is hard to say which method is better or worse for Value Iteration and Policy iteration. For a relatively small MDP, policies generally converge faster. However, for MDP with a large scale (many States), value iteration is easier (linear equations are not needed ).

3. Parameter Estimation in MDP

In the MDP discussed earlier, we know the state transfer probability and return function R (s. However, in many practical problems, these parameters cannot be obtained explicitly. We need to estimate these parameters from the data (usually S, A, and are known ).

Assume that we know many status transfer paths as follows:

It is the I time, the status corresponding to the J transfer path, and the action to be executed during the status. The number of States in each transfer path is limited. In actual operation, each transfer chain either enters the final state or reaches the specified number of steps.

If we get a lot of transfer chains like above (equivalent to having samples), we can use the maximum likelihood estimation to estimate the probability of state transfer.

The number of times the numerator reaches the second after the action a is executed in the S state, and the denominator is the number of times a is executed in the S state. The division between the two is the probability that a will be transferred to S' after a is executed in S state.

To avoid a condition where the denominator is 0, we need to perform smooth operations. If the denominator is 0, the transfer probability is evenly balanced. That is to say, when there is no sample that executes a in the S state in the sample.

The above estimation method is to estimate from historical data. This formula also applies to online update. For example, if we get some new transfer paths, we can correct the numerator denominator of the above formula (plus the new count. After the correction, the transfer probability changes. According to the probability after the change, there may be more new transfer paths, which will become more accurate.

Similarly, if the return function is unknown, we consider R (s) as the average value of the returned function that has been observed in the S state.

When the transfer probability and return function are estimated, we can use value iteration or policy iteration to solve the MDP problem. For example, the process of combining parameter estimation and Value Iteration (without knowing the probability of state transfer) is as follows:

1. Random Initialization

2. loop until convergence {

(A) The number of transition times for each State in the sample statistics, used to update and R

(B) Use the estimated parameters to update V (use the Value Iteration Method in the previous section)

(C) redraw from the updated v

}

In step (B), we need to update the value, which is also a loop iteration process. In the previous section, we initialize V to 0 and then iterate to solve v. After nesting to the above process, if each initialization V is 0, and then iteration updates, it will be very slow. One way to speed up is to initialize v every time as the V obtained in the previous large loop. That is to say, the initial value of V is equivalent to the previous result.

4. Summary

The MDP is a non-deterministic Markov decision-making process, that is, the return function and the Action conversion function are probabilistic. In status s, there is also a probability that the next status s after action a is taken. Thirdly, an important concept in Reinforcement Learning is Q learning. In essence, V (s) related to State S is converted to q related to State S. We strongly recommend Tom Mitchell's last chapter on machine learning, which introduces q learning and more. Finally, the Bellman equation is mentioned. In the introduction to algorithms, there is a dynamic planning algorithm of Bellman-Ford, which can be used to solve the shortest path of a graph with negative weight, the proof of convergence is worth exploring. Some scholars carefully analyzed the relationship between enhanced learning and dynamic planning.

This is the last article in Ng handouts. I still have a learning theory. I am not planning to write it for the moment. I feel that I am not very familiar with learning. After learning the graph model and online learning, let's go back and write the learning theory. In addition, Ng's handouts contain some basic mathematical handouts, such as probability theory, linear algebra, convex optimization, Gaussian process, and hmm.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.