Strengthen learning On-policy and off-policy difference _ reinforcement Learning

Source: Internet
Author: User

On-policy: The policy (value function) that generates the sample is the same as the policy (value function) used when updating parameters on the network. Typical for the Saras algorithm, based on the current policy directly perform a motion selection, and then update the current policy with this sample, so that the generation of sample policy and learning policy the same, the algorithm is On-policy algorithm. This method will encounter the contradiction of exploration and utilization, the best choice which is known at present can not learn the optimal solution, converge to the local optimum, and the study efficiency is reduced by the join exploration. The epsilon-greedy algorithm is a tradeoff between these contradictions. The advantage is direct when, the speed is fast, the disadvantage is not necessarily find the optimal strategy.

Off-policy: The policy (value function) that generates a sample differs from the policy (value function) used when updating parameters on the network. Typical for q-learning algorithm, calculate the expected revenue of the next state using the max operation, direct selection of the optimal action, and the current policy does not necessarily choose to the optimal action, so the sample generated here policy and learning policy different, for the off-policy algorithm. First, a large number of behavioral data (behavior policy) under a certain probability distribution is created, which is intended to explore. Target policy is sought from the data of these deviation (off) optimal policies. Of course, it is necessary to meet the mathematical conditions: the hypothesis pi is the target strategy, Μ is the behavioral strategy, then the Μ learned π conditions are: Pi (a|s) > 0 must have Μ (A|s) > 0 set up. The relationship between the two learning strategies is: On-policy is a special case of Off-policy, and its target policy and behavior policy are one. The disadvantage is the twists and turns, the convergence is slow, but the advantage is more powerful and universal. It is powerful because it ensures that the data is comprehensive and that all behaviors can be overwritten.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.