Exploration and utilization
The ultimate reward for enhancing learning tasks is to be observed after a multi-step action, so let's consider the simplest scenario: maximizing a one-step reward, which is one-step operation only. However, even so, intensive learning is significantly different from supervised learning because the machine tries to discover the results of each action, and no training data tells the machine what to do. In short: missing tags;
To maximize a single-step reward, consider two things: one is to know the rewards of each action, but to perform the most rewarding action.
In fact, one-step reinforcement learning task corresponds to a theoretical model, namely "K-rocker-arm gambling machine". What is the rocker-arm gambling machine, that is, gamblers put a coin, choose a rocker, each rocker has a certain probability to spit coins, this probability gamblers do not know. The goal of gamblers is to find a strategy to make themselves at the same cost, the most benefit.
So, assuming the gambler has 100 coins to do the cost, then he can have two options, one is "exploration only", that is, 100 coins into the 5 rocker arm, to explore each rocker arm to spit out the accumulated amount of coins, so as to find out which rocker optimal; one is "use only", that is, Put 100 coins into the current average reward for the best of the joystick (more than the best one randomly selected). Obviously, both of these are flawed, and to get the best average reward is to find the balance.
Then two algorithms, greedy and softmax, are introduced.
Enhanced Learning ———— K-rocker-arm gambling machine