The basic concept and code realization of "reinforcement learning" reinforcement learning

Source: Internet
Author: User

Selected from deeplearning4j

the heart of the machine compiles

participation: Nurhachu Null, Li Zenan


From AlphaGo to autonomous cars, we can find intensive learning in many of the most advanced AI applications. This technology is how to start from scratch to learn to complete the task, the growth of "beyond the human level" of experts. This article will be a brief introduction.


Neural networks have created recent breakthroughs in areas such as computer vision, machine translation, and time series forecasting-and it can be combined with enhanced learning algorithms to create amazing results, such as AlphaGo (see: No human knowledge required, DeepMind next-generation Go program AlphaGo Zero Mount Nature again).


Reinforcement learning refers to the goal-oriented algorithm, which learns how to achieve a goal or maximization in a number of specific steps, for example, maximizing the score scored by some action in a game. They can start with a blank state and then achieve a performance beyond the human level under the right conditions. Just like kids who are stimulated by candy and corporal punishment, these algorithms are punished when they make the wrong predictions, and they get rewarded when they make the right predictions-that's the point of reinforcement.


Combining deep learning with enhanced algorithms can defeat human champions in Weiqi and Atari games. Although this does not sound convincing enough, it is far superior to their previous accomplishments, and the most advanced advances are now swift.


Two reinforcement learning algorithms deep-q learning and A3C have already been implemented on the DEEPLEARNING4J library, and now it's ready to play the Destroyer Warrior (Doom).


Reinforcement learning solves the problem of the association between immediate action and the delayed response associated with it. Like humans, reinforcement learning algorithms have to wait for a while to see what the results of their decisions are. They run in a delayed-response environment where it is difficult to understand which actions result from multiple steps.


We can expect the enhanced learning algorithm to perform better in a more obscure reality, and it can choose from any number of possible actions in the real world, rather than from a limited selection of video game action options. That is to say, over time, we want them (reinforcement learning algorithms) to have the value of achieving goals in the real world.


Introduction to Intensive Learning (https://docs.skymind.ai/docs?__hstc=3042607. E3fc0b81c1643174a38ec061d10e5084.1517447567935.1517447567935.1517447567935.1&__hssc= 3042607.1.1517447567935&__hsfp=3282609078)


Reinforcement Learning Definition


We can understand reinforcement learning by understanding the concepts of intelligence, environment, status, action, and rewards, and we'll explain these concepts in a few things. Uppercase letters represent a collection of things, and lowercase letters represent instances of things; for example, A is a collection of all possible actions, and a is an instance contained in the collection.


Agent: An intelligent entity that can act, for example, an unmanned drone that can be delivered, or a Super Mario that moves toward the target in a video game. The Reinforcement learning algorithm is a smart body. And in real life, that agent is you.

Action: A is a collection of actions that a smart body can take. An action is almost at a glance, but it should be noted that the agent is choosing from a list of possible actions. In video games, this list of actions may include running to the right or running to the left, jumping to a high source or jumping to the bottom, squatting or standing still. In the stock market, this list of actions may include buying, selling or holding any marketable securities or their variants. When dealing with airborne drones, the action options include much of the speed and acceleration in the three-dimensional space.

Environment (Environment): Refers to the world in which the agent walks. This environment takes the current state and action of the agent as input, and the output is the reward of the agent and the next state of the body. If you are an agent, then your environment is the physical laws and social rules that can handle actions and determine the results of your series of actions.

State (State,s): A state is the specific immediate state in which the agent is located; that is, a specific place and moment, which is a specific immediate configuration that can associate the agent with other important things such as tools, enemies, or rewards. It is the current situation returned by the environment. Have you ever appeared in the wrong place at the wrong time? That is certainly a state.

Reward (REWARD,R): Rewards are feedback that we measure the success or failure of an agent's actions. For example, in a video game, when Mario touches a coin, it will win points. In the face of any given state, the agent will output to the environment in the form of action, and the environment will return a new state of the agent (the new state will be affected by actions based on the previous state) and rewards (if any). Rewards may be immediate, or may be sluggish. They can effectively evaluate the action of the agent.

Policy (policy,π): Policy is the strategy used by the agent to make the next move based on the current state.

Value (VALUE,V): expected long-term gains with discounts, rather than short-term returns R. We define vπ (s) as the current state of S when based on the long-term return of policy π.

Q Value or Action value (q): Q value (Q-value) is similar to the above value, but the difference is that it also uses another parameter-the current action A. vπ (s) refers to the long-term returns based on the current state S, action A and strategy π.


Therefore, the environment is the ability to convert the actions taken in the current state into the next state and reward function; the agent is the function that transforms the new state and reward into the next action. We can be aware of the function of the agent, but we cannot know the function of the environment. The environment is a black box where we can only see input and output. Reinforcement learning is equivalent to the ability of an agent to attempt to approximate this environment, so that we can send the action of maximizing rewards to the black box environment.



In the feedback loop of the above graph, each subscript t and t+1 representing the time step refer to a different state: the state at the T-moment and the t+1 moment. Unlike other forms of supervised learning and unsupervised learning-intensive learning can only be considered a sequence of occurrences-action (state-action) pairs.


Reinforcement learns to judge action by the result of action. It is goal-oriented, and its goal is to acquire some sequence of actions that enable the agent to achieve its goals. Here are some examples:


In a video game, the goal is to complete the game with the highest score, so each additional score in the game affects the agent's subsequent actions, which means that the agent may learn to: in order to maximize its score, he should shoot a battleship, touch a coin or dodge a meteor.

In the real world, the goal of a robot may be to move from point A to point B, and every inch of a robot moving from point A to point B counts as a score.


Reinforcement learning can be distinguished from supervised learning and unsupervised learning by interpreting the input. We can illustrate their differences by describing the "things" they learn.


Unsupervised learning: That's what it looks like. (Unsupervised learning algorithms learn the similarity between things that have no names, and through further expansion, they can be found by identifying unusual or dissimilar instances to detect the opposite or perform anomaly detection)

Supervised learning: That thing is a "double cheeseburger". (tags, contact names and faces ...) These supervised learning algorithms have learned the association between data entity instances and their tags; that is, supervised learning algorithms need to have a labeled data set. Those tags are used to "supervise" and correct algorithms, because the algorithm may make false guesses when predicting tags.

Intensive learning: Eat this thing because it tastes good and can make you live longer. (Incentives based on short-term and rewarding and long-term rewards are the equivalent of the calories you eat or the time you live.) Reinforcement learning can be seen as supervised learning in an environment with sparse feedback.


domain selection of reinforcement learning


A self-reinforcing learning agent can be imagined as a blind person, who relies on the ears and the hands of the white hand staff to try to navigate the world. The agents have small windows that allow them to perceive their environment, but those small windows are even the least suited to the way they perceive their surroundings.


In fact, determining the input and feedback types of your agent is a complex problem that needs to be addressed. This is the so-called domain selection problem. The algorithm of learning to play video games can ignore this problem, because their environment is artificially set, and is strictly limited. As a result, video games provide a sterile lab environment where you can test the idea of reinforcement learning. Domain selection requires human decision-making, usually based on knowledge or theory of the problem that needs to be addressed; For example, the choice of entering a domain in an unmanned vehicle algorithm may include information about radar sensors, cameras, and GPS data.



state-action pair (state-action pair) & complex reward probability distribution

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.