PS: This article is for reading Zhou Zhihua "machine learning" notes
Introducing-------Tasks and rewards
If we want to plant watermelon, it has to go through a lot of steps, it is possible to grow a good melon, of course, it is possible to grow the melon is very poor, or directly to the species died. Then the process of abstraction of the melon, summed up a series of good operations, classified as a melon strategy, then, the process is "enhance learning."
This is a simple illustration, where:
The machine is in the environment, the state space is x, for example, the state space can be health, water shortage, apoptosis, and so on, small x is a state space X in a single state.
The action that the machine can take is a, for example: watering, not watering; all movements constitute a set of actions.
When an action a acts on a state x, the potential transfer function p will cause the environment to shift from the current state to another State in a certain probability. Such as: water shortage, choose watering, there is a probability of transfer to a healthy state.
Then, when moving to another state (another state can be the original state), the environment will give the machine a reward based on the potential "reward" function r, such as: Health is +1, water shortage is-1, apoptosis is-100.
Together, the Reinforcement learning task corresponds to a four-tuple e=<x,a,p,r>
Where,p:x*a*x->r; specifies the state transition probability. R:x*a*x-> specified the reward;
Thinking: What is the relationship between x and a fork symbol?
The transfer of State in the environment, the return of reward is not controlled by the machine, the machine can only affect the environment by selecting the action to be performed, and can only perceive the environment by observing the transferred state and the reward returned.
Give an example: look closely at each state, take the action a after the state shifts the probability p and the obtained reward R;
The machine has to do is to learn a "strategy" π by constantly trying in the environment, according to this strategy, in the state X will know to perform the action a=π (x), for example: To see the state of water scarcity, you know to choose watering action;
There are two ways to express a policy: one is to represent a policy as a function π:x->a, and a deterministic strategy commonly used in this way;
The other is probability, π:x *a-a probability, stochastic strategy commonly used in this expression;
Thinking: What is a deterministic strategy and what is a random strategy?
So, Pi (x,a) is the probability of selecting an action a under a state X, which means that in the case of water scarcity, the sum of the probability of choosing to irrigate the action is 1. Probability is the probability of selecting an action;
Thinking: P is the probability of state transition, why the sum of the transition probabilities of selecting an action under state X is also 1? Coincidence or the associated
The goal of learning is to find strategies to maximize long-term accumulation of rewards, long-term accumulation of a variety of calculation methods, commonly used are "T-step accumulation of rewards" and "Gamma discount cumulative reward." Where RT represents the reward value obtained in step T, E indicates the expectation of all random variables;
Differences from supervised learning:
"State" corresponds to the "example" in supervised learning, which is to remove the sample of the marked feature.
"Action" corresponds to "Mark"
"Policy" corresponds to "classifier"
In this sense, reinforcement learning can be regarded as a supervised learning problem with "delayed marking information".
Introduction to Enhanced learning----