Introduction to Enhanced learning----

Source: Internet
Author: User

PS: This article is for reading Zhou Zhihua "machine learning" notes

Introducing-------Tasks and rewards

If we want to plant watermelon, it has to go through a lot of steps, it is possible to grow a good melon, of course, it is possible to grow the melon is very poor, or directly to the species died. Then the process of abstraction of the melon, summed up a series of good operations, classified as a melon strategy, then, the process is "enhance learning."

This is a simple illustration, where:

The machine is in the environment, the state space is x, for example, the state space can be health, water shortage, apoptosis, and so on, small x is a state space X in a single state.

The action that the machine can take is a, for example: watering, not watering; all movements constitute a set of actions.

When an action a acts on a state x, the potential transfer function p will cause the environment to shift from the current state to another State in a certain probability. Such as: water shortage, choose watering, there is a probability of transfer to a healthy state.

Then, when moving to another state (another state can be the original state), the environment will give the machine a reward based on the potential "reward" function r, such as: Health is +1, water shortage is-1, apoptosis is-100.

Together, the Reinforcement learning task corresponds to a four-tuple e=<x,a,p,r>

Where,p:x*a*x->r; specifies the state transition probability. R:x*a*x-> specified the reward;

Thinking: What is the relationship between x and a fork symbol?

The transfer of State in the environment, the return of reward is not controlled by the machine, the machine can only affect the environment by selecting the action to be performed, and can only perceive the environment by observing the transferred state and the reward returned.

Give an example: look closely at each state, take the action a after the state shifts the probability p and the obtained reward R;

The machine has to do is to learn a "strategy" π by constantly trying in the environment, according to this strategy, in the state X will know to perform the action a=π (x), for example: To see the state of water scarcity, you know to choose watering action;

There are two ways to express a policy: one is to represent a policy as a function π:x->a, and a deterministic strategy commonly used in this way;

The other is probability, π:x *a-a probability, stochastic strategy commonly used in this expression;

Thinking: What is a deterministic strategy and what is a random strategy?

So, Pi (x,a) is the probability of selecting an action a under a state X, which means that in the case of water scarcity, the sum of the probability of choosing to irrigate the action is 1. Probability is the probability of selecting an action;

Thinking: P is the probability of state transition, why the sum of the transition probabilities of selecting an action under state X is also 1? Coincidence or the associated

The goal of learning is to find strategies to maximize long-term accumulation of rewards, long-term accumulation of a variety of calculation methods, commonly used are "T-step accumulation of rewards" and "Gamma discount cumulative reward." Where RT represents the reward value obtained in step T, E indicates the expectation of all random variables;

Differences from supervised learning:

"State" corresponds to the "example" in supervised learning, which is to remove the sample of the marked feature.
"Action" corresponds to "Mark"
"Policy" corresponds to "classifier"

In this sense, reinforcement learning can be regarded as a supervised learning problem with "delayed marking information".

Introduction to Enhanced learning----

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.