Monte Carlo (Monte-carlo) algorithm and sequential difference algorithm in reinforcement learning

Source: Internet
Author: User

"Not completed" Monte Carlo

Monte Carlo is a kind of general algorithm, the idea is to approach the real by random sampling, here only introduced in the reinforcement learning application.
The initial idea should be to run multiple cycles in succession, such as after two times (s, a), and calculates the corresponding GT, then Q (s,a) to take the average on it, but in fact, in order to optimize the strategy or value function, not so many samples after the direct calculation, but each sampling (one cycle) of the iteration calculation and update. features Periodic update:

An entire cycle is over (to the end) before an update is made (updating the value of all experienced state)
so it is unbiased estimation, the so-called unbiased estimate is that the expectation of random variables is the ideal value, biased estimate is the expectation of random variables is not ideal value, no matter how there are deviations.
First-visit:
The first cycle goes through a certain state s, the second cycle also experienced this state for the first time. S, after the second cycle ends, calculates the second period s corresponding G value

(g = This step reward + discount coefficient ^ 1 * next reward + discount coefficient ^ 2 * Next Step R Eward ... + discount factor ^ How many steps from S to start * The last step of the reward),

and then update (the g of the first cycle s and the s of the second cycle are averaged to get the state s value).

if the second cycle goes through this state again after another, the update of the value of s that is not used for computation,

that is, for each cycle, is the and average of the G value of the current G and all previous cycles.
Every-visit
The difference with first-visit is that if the second cycle undergoes this state after another, the same G value is added to the average to get value. That is, the update of the
value of s of each cycle is the G and the average of all the state S of G and all previous cycles that have experienced in this cycle.
Sequential Difference Features:
Can be used for aperiodic, is not terminal, learning material is the previous experience
in the cycle to update, do not need to wait until the end of the whole cycle;
the material for the study is the previous cycle,
so it is biased.
the simplest sequential differential/td (0)
Having gone through state St in this cycle, it is now time to update St's value based on goal (target) to
take a step backwards based on this week's strategy, and then be in state st+1
for learning target  = reward + of State st in this cycle Discount coefficient * st+1 value (the value here is not necessarily the result of the last cycle, or it may have been updated during the last period of st+1) the

error for learning (Error) = value of the Target-st for learning
sequential differential/td for backward n-step (n)
Now in the current cycle of the St state, then now according to the strategy of this cycle to take n step backwards, then in the St+n state, but also or n step n reward, and then update the value of St
Lambda sequential differential/td (λ)
The comprehensive use of TD (N) n from 0 to infinity (to the end of the cycle),
0  to infinity to give different weights, the greater the former, the weight and = 1,
Forwardview
Forwardview is the thought above, asking to go through the whole cycle, which is the same as the Monte Carlo requirement.
online Forward-view:
because off-line request to go through the whole cycle, in the actual situation is not good, so also proposed the Online-forward-view, that is, the number of steps to truncation, not go to the end, Instead of taking the H step back from T, t+h as the end point into the Forward-view, the formula can be.
Backwardview
The 
 Backwardview approximates the TD (λ) of off-line. At the beginning of each cycle, the qualifying trail is initialized to 0, then each step adds a value gradient, and in the next step the qualifying trail is multiplied by a discount factor.

In this way, the qualifying trail is a correction of the gradient of value, replacing the position of the value gradient in the update formula. True on-line TD (λ): Online forward-view is a theoretically accurate algorithm, but the calculation is very large, you can consider the use of qualifying traces, the formula is too complex, do not write 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.