first, deep reinforcement learning of the bubble
In 2015, DeepMind's Volodymyr Mnih and other researchers published papers in the journal Nature Human-level control through deep reinforcement learning[1], This paper presents a model deep q-network (DQN), which combines depth learning (DL) and reinforcement Learning (RL), to show the performance beyond human level in the Atari game platform. Since then, the combination of DL and RL in depth intensive learning (deep reinforcement Learning, DRL) has rapidly become the focus of the AI community.
Over the past three years, the DRL algorithm has been prowess in different fields: in video games [1], board games to defeat human top players [2,3]; control complex machinery to operate [4]; Allocate network resources [5]; Save energy for data centers [6]; Even the machine learning algorithm auto-parameter [7]. Various universities and enterprises have been involved, and put forward a dazzling DRL algorithm and application. It can be said that the past three years is the DRL of the red period. DeepMind, a researcher in charge of the Alphago project, shouted "AI = RL + DL", thinking that the DRL of the combination of DL's representation and RL's reasoning ability would be the ultimate answer to AI.
The number of RL papers grew rapidly [8]
the reproducible crisis of 1.1 DRL
However, researchers have begun to rethink DRL in the last six months. Many algorithms are difficult to reproduce due to the fact that important parameter settings and engineering solution details are often not available in published literature. In September 2017, the research group led by the famous RL expert Doina Precup and Joelle Pineau published the thesis deep reinforcement learning that matters[8], pointing to the current DRL field paper quantity is much but water is big, Experiments are difficult to reproduce and so on. This article has aroused the enthusiastic response in academia and industry. Many people agree with this and have strong doubts about DRL's practical ability.
In fact, this is not the first time the precup& Pineau Research group against DRL. As early as 2 months ago, the study group studied the many elements that caused the DRL algorithm to be difficult to reproduce with sufficient experimentation and wrote reproducibility of benchmarked deep reinforcement learning Tasks For continuous control[9]. In August of the same year, they made a report entitled "Reproducibility of Policy Gradient Methods for continuous Control" in ICML 2017 [10], It is shown in detail that in the process of reproducing multiple algorithms based on the policy gradient, it is difficult to reproduce due to various uncertainty factors. [11] in December, Joelle Pineau was invited to make a report entitled "Reproducibility of DRL and Beyond" at the Nips DRL Symposium. In the report, Pineau first introduced the current "reproducible crisis" in scientific research: in a survey of nature, 90% per cent of respondents said that the issue of "reproducibility" was a crisis in the scientific field, with 52% per cent of respondents thinking it was a serious problem. In another survey, researchers in different fields had a high proportion of people who were unable to reproduce even their own past experiments. "Reproducible crisis" can be seen how grim! Pineau, a survey of machine learning, shows that 90% of researchers also recognize the crisis.
There is a serious "reproducibility crisis" in machine learning field [11]
Subsequently, for the DrL field, Pineau demonstrated a large number of reproducible experiments on the current DRL algorithms in the study group. The experimental results show that different DRL algorithms have different effects on various tasks, different super-parameters and different random seeds. In the second half of the report, Pineau called on the academic community to focus on the issue of "reproducible crises" and, based on her findings, proposed 12 criteria for verifying the "reproducibility" of the algorithm, announcing plans to start a "reproducible Trial challenge" ("reproducible Crisis" in ICLR 2018). In other machine learning areas, ICML 2017 has held reproducibility in machines learning Workshop, and will continue to hold the second session this year, to encourage researchers to do a really solid job, Suppress the foam in the Machine learning field. This series of studies by the Pineau & Precup Research Group has received extensive attention.
Pineau the "reproducibility" criterion of the test algorithm based on a large number of investigations [11]
How many pits exist in 1.2 DRL study?
[12] Also in December, a lively discussion on machine learning irregularities was carried out at the Reddit forum. It has been singled out that some of the DRL representative algorithms have made excellent but hard-to-reproduce representations in the simulator because the authors allegedly modified the physics model of the simulator in the experiment, but avoided it in the paper.
The critical wave of existing DRL algorithms continues to flood. 2018 Valentine's Day, once enrolled in the Berkeley AI Research Laboratory (Berkeley Artificial Intelligence Study Lab, BAIR) Alexirpan through a post deep reinforcement Learning doesn ' t work yet[13] sent a bitter gift to the DRL circle. Through several examples, he summarizes several problems in the DRL algorithm from the experimental point of view:
Very low sample utilization;
The final performance is not good enough, often compared to model-based methods;
Good reward function is difficult to design;
It is difficult to balance " exploration " and " utilization ", so that the algorithm falls into local minima;
Overfitting of the environment;
Catastrophic instabilities ...
Although the author at the end of the article try to propose DRL next step should solve a series of problems, many people still regard this article as DRL "dismissal of the text." A few days later, git PhD Himanshu Sahni post Reinforcement learning never worked, and ' deep ' only helped a bit echoed with it [14], in agreeing with Alexirpan's view at the same time , it is pointed out that good reward function is difficult to design and it is difficult to balance "explore" and "exploit" so that the algorithm falls into the local minimum is the inherent flaw of RL.
Matthew Rahtz, another DRL researcher, responds to Alexirpan by telling the rough course of his attempt to reproduce a DRL algorithm, and gives us a deep understanding of how difficult it is to reproduce the DRL algorithm [15]. Six months ago, rahtz for research interests, chose to OpenAI's thesis deep reinforcement learning from Human preferences to reproduce. In the process of reproducing, almost stepping on the Alexirpan summary of all the pits. He argues that reproducing the DRL algorithm is more of an engineering problem than a mathematical one. "It's more like you're solving a puzzle, there's no regularity, the only way is to keep trying until the inspiration is fully understood." ...... A lot of small details that seem insignificant are the only clues ... Get ready for a couple of weeks each time you stay stuck. "Rahtz has accumulated a lot of valuable engineering experience in the process of reproduction, but the difficulty of the whole process is to make him spend a lot of money and time." He fully mobilized different computing resources, including school room resources, Google Cloud computing engines, and Floydhub, which cost a total of $850. Even so, the project, which was scheduled to be completed in 3 months, ended in 8 months, with a lot of time spent on commissioning.
The actual time to reproduce the DRL algorithm is much more than the estimated time [15]
Rahtz finally achieved the goal of reproducing the paper. In addition to giving readers a detailed summary of the various valuable engineering experience, let us from a specific example of the DRL study actually how much foam, how many pits. It has been commented that "DRL's success may not be because it really works, but because people are taking a big breath." ”
Many famous scholars have joined in the discussion. The prevailing view is thatDRL may have the biggest bubble in the AI field. machine learning expert Jacob Andreas sent a meaningful tweet saying:
Jacob Andreas's spat on DRL.
DRL's success is attributed to the only method in the machine learning community that allows training on test sets.
From Pineau & Precup The first shot to now more than 1 years, DRL was hammered to the holes, from the attention to be generally seen decline. Just as I was preparing to contribute to this article, Pineau was invited to make a report on ICLR 2018 titled reproducibility, reusability, and robustness in DRL [16], and formally began to host " Reproducible experimental challenge. " It seems that the academic community will continue to drl the spat, and the negative comments will continue to ferment. So, where is the root knot of DRL's problem? Is the outlook really so bleak? If not combined with deep learning, where is RL's way out?
When everyone spat DRL, the famous optimization expert Ben Recht, from another point of view to give an analysis.
second, the essential defect of the model-free reinforcement learning
The RL algorithm can be divided into model-based methods (model-based) and model-free methods (Model-free). The former mainly develops from the optimal control field. Typically, a model is established using tools such as the Gaussian process (GP) or Bayesian network (BN) to address specific problems, and then through machine learning methods or optimal control methods such as Model predictive control (MPC), linear Two-time regulator (LQR), linear Two-time Gauss (LQG), Iterative Learning Control (ICL) is solved. The latter is more developed from the field of machine learning, which belongs to the data-driven method. The algorithm optimizes the action strategy by using a large number of samples to estimate the state of the agent, the value function of the action, or the return function.
From the beginning of the year, Ben Recht 13 posts, from the control and optimization perspective, focused on the RL in the model-free method [18]. Recht points out that the model-free method itself has several major drawbacks:
Model-based vs. free model [17]
1. The model-free method cannot be learned from a sample with no feedback signal, and the feedback itself is sparse, so the model-oriented sample utilization is low, and the data-driven approach requires a lot of sampling. For example, in the "Space Invader" and "seaquest" games on the Atari platform, the scores obtained by the agents will increase with the increase of training data. Using the model-free DRL method may require 200 million frames to learn better results. AlphaGo's earliest release in Nature also requires 30 million disk faces for training. As with mechanical control-related problems, training data is far less accessible than video images, so it can only be trained in simulators. And the reality Gap between simulator and real world directly restricts the generalization performance of the algorithm which is trained from it. In addition, the scarcity of data also affects its integration with DL technology.
2. The model-free approach does not modeling specific problems, but instead attempts to solve all problems with a common algorithm. The model-based approach takes advantage of the information inherent in the problem by building a model for specific problems. The model-free approach abandons these valuable information while pursuing universality.
3. The model-based approach establishes a dynamic model for the problem, which is explanatory. The model-free method is difficult to debug because there is no model, and the explanatory is not strong.
4. Compared with model-based methods, especially based on simple linear model, the model-free method is not stable enough and is easily divergent in training.
To confirm the above view, recht a simple LQR-based random search method and the best model-free method in Mujoco experimental environment. In the case of similar sampling rate, the computational efficiency of the model-based random search algorithm is at least 15 times times higher than that of the model-free method [19].
Model-based random search method ars amputated Model-free method [19]
Through Recht's analysis, we seem to have found the root knot of the DRL problem. Nearly three years in the field of machine learning Fire DRL algorithm, many of the model-free method and DL combination, and the natural defects of the model-free algorithm, coincides with the Alexirpan summarized DRL several major problems correspond (see above).
It seems that DrL's root cause is mostly based on a model-free approach. Why do most DRL work based on the model-free approach? I think there are several reasons. First, the model-free method is relatively simple and intuitive, the open source is rich and easy to get started, which attracts more scholars to study, there is more likely to make groundbreaking work, such as the DQN and Alphago series. Second, the current development of RL is still in the initial stage, the academic research focus or focus on the environment is determined, static, the state is mainly discrete, static, fully observable, feedback is also a definite problem (such as Atari game) on. For this relatively "simple", basic, general-purpose problem, the model-free method itself is appropriate. Finally, with the idea of "AI = RL + DL", the academic community overestimated the ability of DRL. DQN's exciting capabilities have led many people to expand around the DQN, creating a series of equally model-free jobs.
The vast majority of DRL methods are extended to DQN, which belongs to the model-free method [20]
So, should DRL abandon the model-free approach and embrace a model-based approach?
third, model-based or model-free, the problem is not so simple
3.1 Model-based approach with great potential for the future
The model-based approach typically first learns the model from the data and then optimizes the strategy based on the learning model. The process of learning the model is similar to the system parameter identification in cybernetics. Because of the existence of the model, the model-based approach can make full use of each approximation model, and the data utilization is greatly improved. In some control problems, the model-based method usually has a 10^2-level sampling rate increase compared to the model-free method. In addition, the learning model tends to be robust to the environment, and when confronted with a new environment, the algorithm can rely on the models that have been learned to make inferences, with good generalization performance.
Model-based approach with higher sampling rate [22]
In addition, the model-based approach is closely related to predictive learning (predictive learning) with great potential. Because of the model, the model can predict the future by itself, which coincides with the demand of predictive learning. [21] Yann LeCun, in fact, introduced predictive learning in a widely watched Nips 2016 topic report, as exemplified by the model-based approach. The author thinks that the RL method based on the model may be one of the important techniques to realize predictive learning.
The model-based approach appears to be more promising. But there is no free lunch, and the existence of the model poses a number of problems.
3.2 Model-free method is still the first choice
The model-based DRL method is relatively less straightforward, and the combination of RL and DL is more complex and more difficult to design. The current model-based DRL approach typically uses Gaussian processes, Bayesian networks, or probabilistic neural networks (PNN) to build models, typical of the Predictron model proposed by David Silver in 2016 [23]. Other work, such as probabilistic Inference for Learning COntrol (PILCO) [24], is not itself based on neural networks, but has an extended version combined with Bn. While guided Policy Search (GPS) uses neural networks for optimization of optimal controllers, the model does not rely on neural networks [25]. In addition, there are models that combine neural networks with models [26]. These jobs are not as intuitive and natural as the model-free DRL approach, and the role of DL varies.
In addition, the model-based approach also has several drawbacks:
1. There is no way to solve problems that cannot be modeled. In some areas, such as NLP, there are many tasks that are difficult to generalize into models. In this scenario, only a method such as the R-max algorithm can interact with the environment first, and a model is calculated for subsequent use. But the complexity of this approach is generally high. Recently, some work combined with predictive learning to build a model, partially solved the problem of modeling difficult, this idea has gradually become a research hotspot.
2. Modeling will bring error, and error often with the iterative interaction between the algorithm and the environment more and more, making the algorithm difficult to ensure convergence to the optimal solution.
3. The model lacks versatility and needs to be re-modeled each time a problem is changed.
For the above points, the model-free approach has a comparative advantage: There are many problems that can not be modeled in reality and imitation learning problems, the model-free algorithm is still the best choice. Moreover, the model-free method has asymptotic convergence in theory, and the optimal solution can be ensured by countless interactions with the environment, which is difficult to obtain based on the model method. Finally, the biggest advantage of free model is that it has very good versatility. In fact, the effect of the model-free approach is usually better when dealing with really difficult problems. [18] Recht also pointed out in blog post that the MPC algorithm, which is very effective in the field of control, is very relevant to the model-free method q-learning.
The difference between a model-based approach and a model-free approach can also be seen as the difference between a knowledge-based approach and a statistical-based approach. In general, the two methods are different, it is difficult to say that one of the methods is superior to another. In the field of RL, the model-free algorithm only occupies a very small part, but based on the historical reason, the current model-free DRL method develops rapidly, while the model-based DRL method is relatively small. The author thinks that we can consider doing more work on model-based DRL to overcome the problems existing in DRL. In addition, we can study the semi-model method combined with the model-based method and the model-free method, which has the advantage of two kinds of methods. This classic work has the RL authority Rich Sutton presented by the Dyna Framework [27] and its disciple David Silver presented by the DYNA-2 framework [28].
From what has been discussed above, we seem to have found a way out of DRL's current predicament. But in fact, the cause of the current DRL predicament is far more than that.
3.3 is not just a question of model or not
As mentioned above, Recht uses a random search-based approach to amputated the model-free approach and seems to have sentenced the death penalty to a model-free approach. But in fact, this contrast is not fair.
In March 2017, machine learning expert Sham Kakade's research group published an article towards generalization and simplicity in continuous control, trying to find a simple general solution to the problem of continuous control [29]. They found that the current simulator has a very big problem, the debugging of the linear strategy has been able to achieve very good results-such a simulator is too rough, no wonder that based on the random search method can be on the same simulator to overcome the model-free Method!
It can be seen that the current test platform in the RL field is very immature, and the experimental results in such a testing environment are not convincing enough. Many of the findings may not be credible, since the acquisition of good performance may simply be due to the use of the simulator's bugs. In addition, some scholars point out that the current performance evaluation criterion of RL algorithm is not scientific. Ben Recht and Sham Kakade have made a number of specific recommendations for the development of RL, including test environments, benchmark algorithms, metrics, and more [18,29]. There are too many areas of RL that need to be improved and normalized.
So what's the next step for RL to break?
Iv. re-examination of intensive learning
The question and discussion of DRL and model-free RL allows us to revisit RL, which is beneficial for the future development of RL.
4.1 Re-examining the research and application of DRL
The work of the DQN and Alphago series is impressive, but the two tasks are inherently relatively "simple". Because the environment of these tasks is deterministic, static, the state is mainly discrete, static, fully observable, feedback is determined, and the agent is single. Currently DRL has not made a stunning breakthrough in solving some of the visible state tasks (such as StarCraft), state-continuous tasks such as mechanical control tasks, dynamic feedback tasks, and multi-agent tasks.
At present, a large number of DRL research, especially in the field of computer vision Research, many of the computer vision of a certain DL-based task is forced to construct the RL problem to solve, the results are often not as good as the traditional method. This research method causes the DRL field paper quantity to increase violently, the water is huge. As a researcher of DRL, we should not find a DL task to force its RL, but should aim at some natural suitable for RL processing task, try to introduce the DL to enhance the existing method in the target identification link or function approximation link ability.
DRL the task of success is inherently relatively simple [30]
In Computer vision task, it is very natural to get good feature expression or function approximation by combining DL. However, in some areas, DL may not be able to play a strong feature extraction, and may not be used for function approximation. For example, DL has been in the robot field for most of the perceptual function, but can not replace the method based on mechanical analysis. While there are cases where DRL are applied to real-world mechanical control tasks such as object grabbing and success, such as qt-opt[70], it often takes a lot of debugging and training time. We should clearly understand the application characteristics of the DRL algorithm: Because of its output randomness, the current DRL algorithm is more used in the simulator than in the real environment. There are three main categories of tasks that are currently useful and only run in simulators, namely video games, board games, and Automated machine learning (AUTOML, such as Google's automl Vision). This is not to say that DRL's application is trapped in the simulator-if it can solve a specific problem, the difference between the simulator and the real world, you can play the DRL power. [71] Google's researchers recently focused on the four-legged robot motion problem by aggressively improving the simulator so that the motion strategy that was trained in the simulator can be seamlessly migrated to the real world. However, considering the instability of the RL algorithm, the end-to-end solution should not be pursued blindly in practical application, but it is possible to consider separating feature extraction (DL) from decision (RL) for better interpretation and stability. In addition, the modular RL (RL algorithm encapsulated as a module) and the RL with other models, will be in the practical application of broad prospects. It is also worthwhile to study how to use DL to learn a suitable expression for the input of the RL module.
4.2 Re-examining the study of RL
Machine learning is an interdisciplinary field of study, and RL is a very significant branch of interdisciplinary nature. The development of RL theory is inspired by the fields of physiology, neuroscience and optimal control, and is still being studied in many related fields. In the field of control theory, robotics, operations research and economics, there are still many scholars who devote themselves to the research of RL, and similar concepts or algorithms are often re-invented in different fields and have different names.
The development of RL is influenced by many disciplines [31]
Princeton University's famous research operations expert Warren Powell once wrote an article titled AI, OR and Control theory:a Rosetta Stone for Stochastic optimization, organized the RL in the same [32] A concept, an algorithm's respective name in AI, or (Operations Research) and control theory, has opened up the gap between different fields. Because of the characteristics of various disciplines, RL research in different fields is unique, which makes the research of RL can draw on the essence of thought in different fields.
Here, the author according to their own understanding of RL, try to summarize some of the interesting research direction:
1. Model-based approach. As mentioned above, the model-based approach not only greatly reduces sampling requirements, but also provides a basis for predictive learning by learning task dynamics models.
2. Improve the data utilization and extensibility of the model-free method. This is the two shortcomings of model-free learning and the ultimate research goal of Rich Sutton. This is a difficult field, but any meaningful breakthrough will bring great value.
3. More efficient exploration strategy (exploration strategies). Balancing "exploration" and "exploitation" is the intrinsic problem of RL, which requires us to design more efficient exploration strategies. In addition to some classical algorithms such as Softmax 、?-greedy[1], ucb[72] and Thompson sampling[73], a number of new algorithms have been put forward in recent years, such as intrinsic motivation [74], Curiosity-driven exploration[75], count-based exploration [76] and so on. In fact, the ideas of these "new" algorithms have appeared in the early 80 's [77], and the organic combination with DL makes them pay attention again. In addition, OpenAI and DeepMind have proposed to improve the exploration strategy by introducing noise on the strategy parameter [78] and neural network weight [79], which opens up a new direction.
4. Combine with imitation learning (imitation Learning, IL). The first successful case of machine learning and autonomous driving alvinn[33] is based on IL, the current RL field of the most top scholar Pieter Abbeel in following Andrew Ng PhD, designed by IL control helicopter algorithm [34] to become the representative work of Il field. [68] The end-to-end autonomous driving system presented by Nvidia was also studied through IL in 2016. And Alphago's way of learning is also IL. Il between RL and supervised learning, both the advantages of both can be faster to get feedback, faster convergence, and reasoning ability, very valuable research. For an introduction to IL, see [35] in this review.
5. Reward shape (reward Shaping). Reward is feedback, and its impact on the performance of the RL algorithm is enormous. Alexirpan's blog post has shown how bad the RL algorithm can produce without a well-designed feedback signal. Well-designed feedback signal has always been a research hotspot in RL field. In recent years, a lot of "curiosity" based on the RL algorithm and the hierarchical RL algorithm, the idea of these two algorithms are in the model training process to insert the feedback signal, thus partially overcome the problem of too sparse feedback. Another way of thinking is to learn the feedback function, which is one of the main ways of inverse reinforcement learning (inverse RL, IRL). In recent years, the fire of Gan is also based on this idea to solve the problem of generation modeling, Gan's proposed Ian Goodfellow also think Gan is a way of RL [36]. gail[37, which combines GAN with traditional IRL, has attracted the attention of many scholars.
6. The transfer learning and multi-task learning in RL. The current RL sampling efficiency is very low, and the knowledge learned is not universal. Migration learning and multi-tasking learning can solve these problems effectively. By migrating the strategy learned from the original task to the new task, you avoid learning from scratch for the new task, which can greatly reduce the data requirements and improve the adaptive ability of the algorithm. One of the great difficulties in using RL in real world is the instability of RL, a natural idea is to migrate the stable strategy trained in the simulator to the real environment through migration learning, and the strategy can meet the requirements in the new environment with only a small amount of exploration. However, one of the major problems in this area of research is the reality Gap (Reality Gap), where the simulator's simulation environment differs greatly from the real environment. A good simulator can not only effectively fill the real gap, but also meet the requirements of the large sampling of RL algorithm, so it can greatly promote the research and development of RL, as mentioned above sim-to-real[71]. At the same time, this is also a combination of RL and VR technology point. Recently, academia and industry have been exerting their force in this field. The Gazebo, Eurotruck Simulator, Torcs, Unity, Apollo, Prescan, Panosim and Carsim simulators are featured in the autonomous driving field, while the Carla simulator developed by Intel Research Institute [38] Gradually become the standard of industry research. The development of simulators in other fields is also flourishing: in the field of home environment simulation, MIT and the University of Toronto jointly developed a feature-rich Virturalhome Simulator, and MIT has developed the flight Goggles simulator in the field of UAV simulation training.
7. Improve the generalization ability of RL. The most important goal of machine learning is generalization, and most of the existing RL methods perform poorly on this indicator [8], and no wonder Jacob Andreas criticizes the success of RL from "train on the test set". This problem has aroused wide attention in the academic field, and the researchers try to improve generalization ability by studying the dynamics model of environment [80], reducing model complexity [29] or model-independent learning [81], which also promotes the development of model-based approach and meta-learning (meta-learning) method. The main goal of the famous Dex-net project proposed by Bair is to build a robot grab model with good robustness and generalization capability [82], and OpenAI also organized OpenAI Retro Contest in April 2018, encouraging participants to develop RL algorithm with good generalization ability. [83].
8. Level RL (Hierarchical RL, HRL). Professor Zhou Zhihua three conditions for the success of DL are: There are layer-by-level processing, characteristic internal changes and sufficient model complexity [39]. and HRL not only satisfies these three conditions, but also has the stronger reasoning ability, is a very potential research field. [40] HRL has demonstrated a strong ability to learn in some tasks that require complex reasoning, such as the Montezuma's Revenge game on the Atari platform.
9. Combine with sequence prediction (Sequence prediction). Sequence prediction is similar to the problem solved by RL and IL. There are a lot of ideas between the three to learn from each other. There are currently some methods based on RL and IL that have yielded good results on the Sequence prediction task [41,42,43]. This breakthrough in this direction will have a wide impact on many tasks in video prediction and NLP.
10. (Model-Free) method explores the safety of behavior (safe RL). Compared with the model-based approach, the model-free approach lacks predictive power, which leads to more instability in its exploratory behavior. One of the research ideas is to model the uncertainty of RL proxy behavior by combining Bayesian method, so as to avoid too dangerous exploration behavior. In addition, in order to safely apply RL to a real-world environment, a dangerous area can be delineated by using mixed-reality technology in the simulator to constrain the agent's behavior by restricting the agent's activity space.
11. Relationship RL. The "relationship learning" of studying the relationship between objects and reasoning and predicting in the near future has been paid much attention to by academic circles. Relationship learning often constructs a state chain in training, and the intermediate state is disjointed from the final feedback. RL can return the final feedback to the middle State, to achieve effective learning, and thus become the best way to achieve relationship learning. VIN[44] and pridictron[23, proposed by DeepMind in 2017, are representative of this. In June 2018, DeepMind also published a series of studies related to the direction of study, such as the relationship induction bias [45], the relationship rl[46], the relationship rnn[47], the graph network [48] and has been published in the Journal of Science of the generation of query network (generative query NETWORK,GQN) [49]. This series of compelling work will lead to the boom of the relationship RL.
12. Against the sample RL. RL is widely used in mechanical control and other fields, these areas compared to image recognition and so on, the robustness and security requirements are higher. Therefore, the attack against RL is a very important issue. Recent studies have shown that many classical models, such as DQN, cannot withstand the disturbance of the attack [50,51], which can be manipulated against the sample.
13. Handle the input of other modes. In the field of NLP, RL has been used to deal with many modal data, such as sentences, chapters, knowledge bases and so on. But in the field of computer vision, the RL algorithm mainly uses neural network to extract the features of image and video, and seldom touches on other modal data. We can explore ways to apply RL to other modal data, such as processing rgb-d data and LIDAR data. Once the feature extraction of a certain data is greatly reduced, it is possible to achieve Alphago level breakthrough by combining it with RL organically. The Intel Institute has done a number of things in this area based on the Carla Simulator.
4.3 re-examining the application of RL
The current view is that "RL can only play games, play chess, and others have done." And I think that we should not be too pessimistic about RL. Actually can surpass the human in the video game and the chess game, has proved the RL reasoning ability powerful. After reasonable improvement, there is hope to be widely used. Often, the transformation from research to application is not intuitive. For example, IBM Watson? The system is known worldwide for its ability to understand and respond to natural language, and has won the Jeopardy! Championship in 2011 after defeating human players. And one of the supporting technology behind it is that year Gerald Tesauro development Td-gammon program [52] used when the RL technology [53]. The technology, which was used only for chess, has played an integral role in the best question and answer system. Today's RL development level is much higher than that year, how can we not have confidence?
Powerful IBM Watson, with RL playing a central role behind it.
Through the investigation, we can find that the RL algorithm has been widely used in various fields:
1. Control areas. This is one of the origins of RL thought, and is the most mature field of RL technology application. The field of control and machine learning have developed similar ideas, concepts and techniques, which can be used to learn from each other. For example, the current widely used MPC algorithm is a special RL. In the field of robotics, compared to the DL can only be used for perception, RL compared to the traditional method has its own advantages: traditional methods such as LQR and other general based on Graph search or probabilistic search learning to a trajectory level strategy, high complexity, not suitable for re-planning, and the RL method learned is the state-action space of the Strategy, have better adaptability.
2. Autonomous driving field. Driving is a sequence decision process, so it is naturally suitable for use with RL. From the 80 's Alvinn, Torcs to today's Carla, the industry has been trying to use RL to solve the problem of automatic driving of single vehicles and the traffic scheduling problems of multiple vehicles. Similar ideas are widely used in a variety of aircraft, underwater UAV field.
3. NLP field. Many of the tasks in the NLP field are multi-wheeled, which require multiple iterations of interaction to find the optimal solution (such as a dialogue system), and the feedback signal for a task often needs to be acquired after a series of decisions (such as machine writing). The characteristics of such problems are naturally suitable to be solved with RL, so in recent years RL has been applied to many tasks in the NLP field, such as text generation, Text digest, sequence labeling, dialogue robot (text/speech), machine translation, relationship extraction and knowledge map inference, etc. There are a number of successful applications, such as the Milabot model developed by the Yoshua Bengio Research group in the field of dialogue robotics [54], the Facebook chat robot [55], etc., and the machine translation domain Microsoft Translator [56]. In addition, in a series of tasks spanning NLP and computer vision, such as VQA, Image/video Caption, Image grounding, Video summarization, and so on, RL technology is also a great skill.
4. Recommendation system and search system field. The Bandits series algorithm in RL has long been widely used in the fields of product recommendation, news recommendation and online advertising. In recent years, a series of work has been done to apply RL to information retrieval and sequencing tasks [57].
5. Financial sector. RL's strong sequential decision-making capabilities have been the focus of the financial system. Both Wall Street giants JPMorgan Chase and start-up companies such as Kensho have introduced RL technology into their trading systems.
6. Selection of the data. In the case of enough data, how to choose the data to achieve "fast, good, provincial" learning, has a very large application value. A series of work has sprung up recently in this area, such as the reinforced co-training proposed by UCSB Jiawei Wu [58].
7. Communication, production scheduling, planning and resource access control and other operational areas. The tasks in these areas often involve the process of "selecting" the action, and the label data is difficult to obtain, so the RL is widely used to solve.
For a more comprehensive overview of the application of RL, see the literature [59,60].
Although there are many successful applications listed above, we still have to realize that the current development of RL is still in its infancy and cannot be fit all. There is no generic RL solution that is as mature as a DL to become a plug-and-play algorithm. Different RL algorithms lead each other in their respective fields. Before we find a pervasive approach, we should design specialized algorithms for specific problems, such as in the field of robotics, a method based on Bayesian RL and evolutionary algorithms (such as cmaes[61]) that is more appropriate than DRL. Of course, different areas should be used to learn from each other and promote. The output of the RL algorithm is random, which is the essential problem brought by its "exploration" philosophy, so we can't blindly all in RL, nor should RL in all, but to find the right RL suitable for solving the problem.
Different RL methods that should be used for different problems [22]
4.4 Re-examining the value of RL
In Nips 2016, Yan LeCun that the most valuable problem is the "predictive learning" problem, which is similar to the unsupervised learning problem. His speech represents a recent mainstream view of the academic community. And Ben Recht that RL is more valuable than supervised learning (supervised learning, SL) and unsupervised learning (unsupervised learning, UL). He compared these three types of learning with the description analysis (UL), predictive analysis (SL) and Guidance Analysis (RL) in business analysis [18].
Descriptive analysis is a summary of existing data, resulting in a more robust and clear representation of the problem, which is the easiest but the lowest value. Because the value of descriptive analysis lies more in aesthetics than in reality. For example, "What is the style of using Gan to render a picture of a room" is far from being "more important than predicting the price of the room according to the picture of the room." The latter is a predictive analysis problem--predicting current data based on historical data. However, in the description analysis and prediction analysis, the system is not affected by the algorithm, and the guidance analysis of the algorithm and the system of interaction between the modeling, by actively affecting the system, maximize the value of revenue. Analogy with the above two examples, the guidance analysis is to solve the "how to improve the room by a series of transformation to maximize room prices" and the like. This is the hardest problem because it involves complex interactions between algorithms and systems, but it is also the most valuable because the natural goal of the Guiding Analysis (RL) is to maximize value and to solve problems for humans. And, whether it is a descriptive analysis or a predictive analysis, the environment of the problem being dealt with is static and unchanging, and this hypothesis is not true for most practical problems. and the guiding analysis is used to deal with the problem of the dynamic change of the environment, even considering the cooperation or competition with other opponents, which is more similar to most practical problems faced by mankind.
The most difficult and valuable guide to analyzing problems [18]
In the last section, I will try to discuss the method of learning from feedback similar to RL in a broader context, and try to give the reader a new perspective of looking at RL.
v. the generalized RL——
Learn from Feedback
This section uses the term "generalized RL" to refer to the "Learning from feedback" across multiple disciplines of research. Unlike the RL in the fields of machine learning, cybernetics, and economics, described above, this section covers a broader range of disciplines, all of which involve feedback learning systems, all of which are referred to as generalized RL.
5.1 Generalized RL, is the ultimate goal of AI research
In 1950, Turing presented the famous "Turing test" concept in his epoch-making paper computing machinery and intelligence[62]: If a person (code C) uses a language that the test object understands, ask two objects that he cannot see any string of questions. The object is: A person who is normal thinking (code b), a machine (code a). If, after a number of inquiries, C cannot derive a substantial difference between A and B, then this machine A is tested by Turing.
Note that the concept of "Turing test" already contains the concept of "feedback"-humans are judged by feedback from the program, while AI programs deceive humans by learning feedback. Also in this paper, Turing said, "In addition to trying to build a program that simulates the adult brain, why not try to build a program that mimics the brain of a child?" If it receives proper education, it will acquire the brains of adults. "-Is it not the way RL learns to improve ability from feedback? As can be seen, the concept of artificial intelligence from being presented when its ultimate goal is to build a good enough feedback learning system.
In 1959, AI pioneer Arthur Samuel formally defined the concept of "machine learning". [63] It was this Samuel, who developed a chess program based on RL in the 50 's, and became the first successful case in the field of artificial intelligence. Why do AI pioneers often focus on RL-related tasks? The classic masterpiece "Artificial Intelligence: A modern Approach" in the review of RL may answer the question: You can think of RL as encompassing all the elements of AI: an agent that is placed in an environment and must learn to be in it (reinforcement learning might be Considered to encompass all of the Ai:an agents are placed in a environment and must learn to behave successfully therein.) [6 4].
Not only in the field of artificial intelligence, the philosophical field also emphasizes the significance of behavior and feedback to the formation of intelligence. [65] Generative theory (enactivism) believes that behavior is the basis of cognition, and that behavior and perception are mutually stimulating, and that the intelligent body receives feedback of behavior through perception, while behavior brings real and meaningful experience to the environment by the agent.
Behavior and feedback are the cornerstone of intelligent formation [65]
It seems that learning from feedback is really a core element of achieving intelligence.
Back to the field of AI. After the DL was successful, it was combined with RL to become DRL. After the successful research of Knowledge base, the memory mechanism is gradually added to the RL algorithm. The variational reasoning has also found a binding point with RL. Recently, the academia began to reflect on the upsurge of DL, to rekindle the interest in causal reasoning and symbolic learning, so there is a relationship between RL and symbolic rl[66] related work. By reviewing the development of the academic, we can also summarize the development of artificial intelligence is a feature: whenever a relevant direction to make a breakthrough, will always return to the RL problem, seeking to combine with RL. Rather than regard DRL as the extension of DL, it is better to think of RL as a return. So we do not have to worry about the DRL bubble, because RL is the ultimate goal of artificial intelligence, has a strong vitality, the future will usher in a wave of development.
5.2 Generalized RL, is the form of all future machine learning Systems
In his last blog post [67], Recht stressed that as long as a machine learning system is improved by receiving external feedback, the system is not just a machine learning system, but also an RL system. A/B testing, currently widely used in the Internet domain, is one of the simplest forms of RL. In the future, machine learning systems have to deal with data that is distributed dynamically and learn from feedback. So it can be said that we are going to be in a "All machine learning is RL" era, academia and industry need to increase the research efforts of RL. Recht from the social and moral aspects of this issue is discussed in detail [67], and he from the point of control and optimization of the RL a series of thoughts summed up into a review of the article for the reader to think [69].
5.3 Generalized RL, is a common goal in many fields of research
In section 4.2, it has been mentioned that RL has been invented and researched in the field of machine learning, in fact, the idea of learning from feedback has been studied in many other fields. Here are just a few examples:
In the field of psychology, the contrast between the classical reflex and the operational reflex is like the contrast between SL and RL, while the "observational learning" theory proposed by the famous psychologist Albert Bandura is very similar to IL, and the "projective identity" proposed by psychoanalysis master Melanie Klein Can actually be seen as a RL process. In many areas of psychology, the most recent association with RL is the Behaviorism School (behaviorism). His representative John Broadus Watson applied the behavioral psychology to the advertising industry, which greatly promoted the development of advertising industry. It's hard not to think that one of the most sophisticated applications of the RL algorithm is internet advertising. The cognitive behavioral therapy developed by the influence of the behavioral doctrine is similar to the strategy transfer method in RL. The origins of the theory of behavior and RL are quite deep, and can even be said to be another source of RL thought. This article is limited in length and cannot be described in detail, please refer to the psychology literature for interested readers such as [53].
In the field of pedagogy, there has been a comparison and research on the two ways of "active learning" and "passive learning", and the representative study has cone of Experience, whose conclusion is very similar to the comparison between RL and SL in machine learning field. and the "inquiry learning" advocated by the educator Dewey refers to the active exploration of the learning method of seeking feedback;
In the field of organizational behavior, scholars explore the difference between "initiative personality" and "passive personality" and the influence on organization.
In the field of enterprise management, the "Exploratory behavior" and "utilization behavior" of enterprises have always been a research hotspot.
......
It can be said that everything involves selecting and then getting feedback, and then learning from the feedback field, almost all of the ideas of RL exist in various forms, so I call it generalized RL. These disciplines for the development of RL has provided a wealth of research materials, accumulated a large number of ideas and methods. At the same time, the development of RL will not only have an impact on the field of artificial intelligence, but also promote the broad range of RL contains many disciplines to advance together.
Deep reinforcement learning bubbles and where is the road?