Selected from "Reinforcement Learning:an Introduction", version 2, Chapter2
https://webdocs.cs.ualberta.ca/~sutton/book/bookdraft2016sep.pdf
In the introduction, this leads to Chapter2:
one of the challenges that arise in reinforcement learning, and not in other kinds of learning, is the trade-off bet Ween exploration and exploitation. To obtain a lot of reward, a reinforcement learning agents must prefer actions that it have tried in the past and found to B e effective in producing reward. But to discover such actions, it had to try actions, which it has not selected before. The agent has to exploit about it already knows in order to obtain reward, but it also have to explore in order to make Bett Er action selections in the future. The dilemma is this neither exploration nor exploitation can be pursued exclusively without failing at the task. The agent must try a variety of actions and progressively favor those that appear to being best. On a stochastic tasks, each action must is tried many times to gain a reliable estimate of its expected reward. The exploration-exploitaion dilemma have been intensively studied by mathematicians for many decades (see Chapter 2). For now, we simply note this entire issue of balancing exploration and exploitation does not even arise in supervised and unsupervised learning, at least in their purest forms.
One of the challenges of enhancing learning is how to deal with the tradeoff between exploration and exploitation, which is not what other kinds of learning problems are. In order to gain a lot of rewards and benefits, enhanced learning agents tend to choose behaviors that have been tried and profitable in the past. But in order to find such a behavior, it must try to not choose before. In other words, for the agent, on the one hand, it wants to use what it already knows to gain profit, on the other hand, it must actively explore so that the future can make a better choice. Contradiction lies in the excessive pursuit of exploration or exploitation will lead to the failure of the task. So agents should actively try out a variety of behaviors on the one hand, and should try to choose the ones that currently look best. In a randomized trial, each behavior must be tried multiple times to obtain the most reliable estimate of the desired yield. Exploration-exploitation contradictions have been extensively studied by mathematicians for decades (see Chapter 2nd). At least now we can simply understand that the problem of balancing exploration and exploitation does not appear in supervised and unsupervised learning problems.
The Chapter2 is this:
The most import feature distinguishing reinforcement learning from other types of learning are that it uses training inform Ation that evaluates the actions taken rather than instructs by giving correct actions. This is a creates the need for active exploration, a explicit trail-and-error search for good behavior. Purely evaluative feedback indicates how good the action taken are, but not whether it's the best or the worst action poss Ible. Purely instructive feedback, on the other hand, indicates the correct action to take, independently of the action actually Taken. This kind of feedback is the basis of supervised learning, which includes large parts of pattern classification, Artificia L Neural networks, and system identification. In their pure forms, these-kinds of feedback is quite distinct:evaluative feedback depends entirely on the action Ta Ken, whereas instructive feedback is independent of the action taken. There is also interesting intermediate cases in which Evaluation and instruction blend together.
Reinforcement learning differs from other classes of learning, and it uses training data not only to give correct behavioral instructions, but also to evaluate the behavior (the rewards and benefits of using that behavior). This leads to an active exploration requirement through an explicit search for favorable behavior. Simple evaluative feedback Indicates how much revenue is generated if an action is taken, rather than merely judging that the best activity is the worst. From another point of view, purely indicative feedback indicates only the correct behavior that should be taken, regardless of the actual behavior taken. This feedback is the basis for supervised learning. The two types of feedback are completely different: evaluative feedback relies entirely on the behavior that has been taken, and the indicative feedback is independent of the behavior actually taken. There are also some examples in between.
In this chapter we study the evaluative aspect of reinforcement learning in a simplified setting, one it does not involv e Learning to act in the more than one situation. This nonassociative setting are the one in which most prior work involving evaluative feedback have been done, and it avoids Much of the complexity of the full reinforcement learning problem. Studying this case would enable us to see more clearly how evaluative feedback differs from, and yet can is combined with I Nstructive feedback.
This chapter studies the evaluation of enhanced learning in a simplified scenario, in which the so-called simplified scenario means that no multiple learning scenarios are involved. This non-associative scenario already has a lot of relevant work involved in evaluative feedback, but it is simpler than the full enhancement of learning problems. Learning these examples helps us understand evaluative feedback and the combination of indicative feedback.
The particular nonassociative, evaluative feedback problem that we explore are a simple version of the K-armed bandit Probl Em. We can use this problem to introduce a number of basic learning methods which we extend on later chapters to apply to the Full reinforcement learning problem. At the end of this chapter, we take a step closer to the full reinforcement learning problem by discussing what happens WH En the bandit problem becomes associative, that's, when actions be taken in more than one situation.
This special non-associative evaluative feedback problem we are going to explore is a simplified version of the k-armed Bandit problem. We use this question to elicit the basic method of full enhancement learning to be introduced in subsequent chapters. At the end of this chapter, the bandit problem is extended so that the action occurs in multiple scenarios and the associated version is obtained.
Summarize:
Multi-armed Bandit Problem (also known as k-armed Bandit problem) is not a complete reinforcement learning, but simply a simplified version. So the book will bandit problem as a primer, lead to reinforcement learning problem. Some of the concepts in reinforcement learning are extended by some of these concepts.
Multi-armed Bandit problem and enhancing learning links