International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Monte Carlo (Monte-carlo) algorithm and sequential difference algorithm in reinforcement learning

Last Update:2018-08-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

"Not completed" Monte Carlo

Monte Carlo is a kind of general algorithm, the idea is to approach the real by random sampling, here only introduced in the reinforcement learning application.
The initial idea should be to run multiple cycles in succession, such as after two times (s, a), and calculates the corresponding GT, then Q (s,a) to take the average on it, but in fact, in order to optimize the strategy or value function, not so many samples after the direct calculation, but each sampling (one cycle) of the iteration calculation and update. features Periodic update:

An entire cycle is over (to the end) before an update is made (updating the value of all experienced state)
so it is unbiased estimation, the so-called unbiased estimate is that the expectation of random variables is the ideal value, biased estimate is the expectation of random variables is not ideal value, no matter how there are deviations.

First-visit:

The first cycle goes through a certain state s, the second cycle also experienced this state for the first time. S, after the second cycle ends, calculates the second period s corresponding G value

(g = This step reward + discount coefficient ^ 1 * next reward + discount coefficient ^ 2 * Next Step R Eward ... + discount factor ^ How many steps from S to start * The last step of the reward),

and then update (the g of the first cycle s and the s of the second cycle are averaged to get the state s value).

if the second cycle goes through this state again after another, the update of the value of s that is not used for computation,

that is, for each cycle, is the and average of the G value of the current G and all previous cycles.

Every-visit

The difference with first-visit is that if the second cycle undergoes this state after another, the same G value is added to the average to get value. That is, the update of the
value of s of each cycle is the G and the average of all the state S of G and all previous cycles that have experienced in this cycle.

Sequential Difference Features:

Can be used for aperiodic, is not terminal, learning material is the previous experience
in the cycle to update, do not need to wait until the end of the whole cycle;
the material for the study is the previous cycle,
so it is biased.

the simplest sequential differential/td (0)

Having gone through state St in this cycle, it is now time to update St's value based on goal (target) to
take a step backwards based on this week's strategy, and then be in state st+1
for learning target  = reward + of State st in this cycle Discount coefficient * st+1 value (the value here is not necessarily the result of the last cycle, or it may have been updated during the last period of st+1) the

error for learning (Error) = value of the Target-st for learning

sequential differential/td for backward n-step (n)

Now in the current cycle of the St state, then now according to the strategy of this cycle to take n step backwards, then in the St+n state, but also or n step n reward, and then update the value of St

Lambda sequential differential/td (λ)

The comprehensive use of TD (N) n from 0 to infinity (to the end of the cycle),
0  to infinity to give different weights, the greater the former, the weight and = 1,

Forwardview

Forwardview is the thought above, asking to go through the whole cycle, which is the same as the Monte Carlo requirement.
online Forward-view:
because off-line request to go through the whole cycle, in the actual situation is not good, so also proposed the Online-forward-view, that is, the number of steps to truncation, not go to the end, Instead of taking the H step back from T, t+h as the end point into the Forward-view, the formula can be.

Backwardview

The  Backwardview approximates the TD (λ) of off-line. At the beginning of each cycle, the qualifying trail is initialized to 0, then each step adds a value gradient, and in the next step the qualifying trail is multiplied by a discount factor.

In this way, the qualifying trail is a correction of the gradient of value, replacing the position of the value gradient in the update formula. True on-line TD (λ): Online forward-view is a theoretically accurate algorithm, but the calculation is very large, you can consider the use of qualifying traces, the formula is too complex, do not write

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Monte Carlo (Monte-carlo) algorithm and sequential difference algorithm in reinforcement learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support