This topic will be about Alphago's past life, first of all, we explore the source of Alphago core technology, then we have David Silver and other people's two nature paper as the basis for the deconstruction Alphago and its upgraded version Alphago Zero. I have a limited level, if I have errors, I also hope to correct.

Go is a zero-sum perfect information game, 0 and that is, the two sides are competitive relationship does not exist in collaboration, there is a party win must have a loser. The perfect message is that all information is visible on the table and is not hidden. Chess, checkers and other board games have such characteristics, after decades of development, the computer in response to this kind of game is handy, chess checkers are all "breached", that is, to defeat the top masters of mankind. But the progress on the go is not smooth, always difficult to form a breakthrough. The reason, there are two points, one is the search space is huge, two is difficult to find a suitable location evaluation function. These two obstacles puzzled scholars for decades, until the Monte-Carlo search tree turned out, scholars have seen the dawn. Then, the deep convolutional neural network comes into being, its own powerful automatic feature extraction ability solves the characteristic problem. Supplemented by self-game reinforcement learning, finally alphago birth announced that go has been "breached."

Alphago is a go program developed by DeepMind, combining deep learning, intensive learning and Monte-Carlo tree search. The application of deep learning is inspired by the deep Q network, which uses depth learning to perform efficient automated feature extraction to estimate value functions. The application of intensive learning is inspired by the Td-gammon, and the reinforcement learning is applied to the simulated chess game which is generated by self-improvement of the original strategy. Monte-Carlo Tree Search is a powerful search method, which is developed from the traditional Minimax search, and the most powerful go program before Alphago uses this search method.

Today, we start with deep Q network, and explore the source of Alphago technology.

We are certainly not unfamiliar with deep learning, but the reinforcement of learning may be a lot of people just heard that there is no deep understanding, then I will first of all about the reinforcement of learning is what (if you want to learn more, recommend two materials, one is Alphago's father David Silver in the UCL intensive learning course, There is a subtitle version on station B, and the second is an introduction to intensive learning written by Richard Sutton, the father of intensive learning reinforcement learning an Introduction 2nd Edition)

The introduction of Wikipedia is concise and clear, and I will extract it directly.

Reinforcement learning is an area of machine learning that emphasizes how to act on the basis of the environment in order to achieve maximum expected benefits. It is inspired by the theory of behavioral doctrine in psychology, that is, how organisms, stimulated by incentives or punishments given by the environment, gradually form an expectation of stimulation, resulting in customary sexual behavior that maximizes benefits.

In layman's words, reinforcement learning is about the interaction between the agent and the environment, and the trial and error learning in the interactive process. The agent observes a certain state, chooses the action according to this state to affect the environment, the environment returns to the agent reward and the new state. Compared to supervised learning, its uniqueness is that there is no supervisor, only the reward signal; the second is that the reward is not instantaneous; the third is that the data does not need to be distributed separately, and that the action of the agent affects the data that is received next. Therefore, intensive learning is very good at making serialization decisions, and is well suited for applications in the field of games.

Deep Q Network

In the past, intensive learning has achieved great success in many fields, but all of them have obvious limitations. This is reflected in two points: first, the need for artificial generation of characteristics, the second is the need for state space is low-dimensional and fully observable. Now, with the help of the Deep convolutional Neural Network (CNN), we are able to automatically extract features from the original high-dimensional image input for end-to-end reinforcement learning. More importantly, we can apply a set of learning systems to many different games, and can achieve the human level, in the application process does not need to make any game-related adjustments, do not need any artificial characteristics, this is nature thesis Human-level control through Deep reinforcement learning the biggest breakthrough in the area.

The algorithm used in this paper is q-learning with non-linear function approximation here the nonlinearity refers to CNN. However, the nonlinear function approximation has a major flaw, that is, there are three reasons for convergence. One is because we did not use the true gradient information when Q-learning was updated:

\[w = w + \alpha\left[r + \gamma\max_{a '}{q (s ', a ', W)}-Q (S, A, W) \right]\nabla{q (S, A, W)}\]

Note that at this point the gradient does not contain the target value \ (r + \gamma\max_{a '}{q (s ', a ', W)}\), while the (w\) Update, the target value is also constantly changing, this gradient method is becoming a semi-gradient descent.

The second reason is the sequence correlation between the observed values, the data does not satisfy the independent and the distribution nature, thus cannot adopt the traditional supervised learning method. The third reason is that an update with a small action value can produce a completely different policy. The 3rd is the inherent flaw in the value-based approach, but we can make improvements at the first two points.

For the 1th, we use a fixed Q target, this fixed target will only be updated periodically, so that we can ensure that the gradient drops when the target is independent of the parameters, so that the real gradient information can be used.

For the 2nd, we use the experience replay method, which stores the experience \ ((s, a, R, s ') \) in the empirical pool and randomly samples, which can break the sequence correlation. It is important to note that the Off-policy method is necessary when using experience playback, because the parameters of the current and generated samples are different, and the use of q-learning becomes a matter of course. However, this method has a flaw that the sampling process is random, giving the same weight to all samples, which is obviously unreasonable. An improved method is to refer to the priority sweeping method, which is sampled according to the size of the TD error.

The final loss function is (section \ (i\) loop):

\[l_i (w_i) = e_{(S, a, R, S ')}\left[(R + \gamma\max_{a '}{q (s ', a ', w_i^-)}-Q (S, A, w_i)) ^2\right]\]

CNN's input is the game screen 84x84x4, where 4 is the latest 4 game screen. The output is an action, the number of which varies from 4 to 18 depending on the game. That is, the input state \ (s\), the output action state value \ (Q (s, a) \). The complete algorithm is as follows:

Game Search Tree

The traditional chess game solution is through the minimax algorithm and Alpha-beta pruning, the core idea is to use DFS to traverse the current situation after all possible results, by maximizing their own and minimizing the other way to get the next move. In simple games like Tic-tac-Chess, this method is feasible. However, in other games, the search tree will grow exponentially, we cannot search the end of the game, so we can only make compromises, using the value function \ (V (S, W) \) to approximate the real value \ (v_* (s) \), estimate on the leaf node. As a result, the Minimax algorithm only needs to run a fixed depth, which reduces the depth of the **search tree** . In the aspect of Value function selection, the common value function is two Yuan linear value function, each characteristic is two yuan, the weight is selected by manual. The value function is represented by \ (V (s, W) = X (s) * w\). Chess program Deep Blue is the use of this method, including the \ (8000\) features and manually set weights, using parallel Alpha-beta search, search forward \ (16\) to \ (40\) step, Successfully defeated the human chess champion.

In addition to the search tree method, there is one way to strengthen learning by self-game. Self-multi-agency can be in many forms, can be literally, the white and their own black game, you can also be white and the strategy pool randomly selected in the past a time of their own black game, AlphaGo the latter, but here we discuss the former.

In this framework, the iteration of the value function (the update of the weights) is completely in accordance with the general reinforcement learning method. The difference is that we have to consider the subsequent state, because the environment of the board game is determined (the rule is known), i.e. \ (q (s, a) = V (succ (S, a)). The selection of actions is based on the maximum minimum principle:

For white, maximize subsequent status values,\ (a_t = \arg\max V (succ (s_t, A)) \)

For black, minimize subsequent status values,\ (a_t = \arg\min V (succ (s_t, A)) \)

In the ideal case, the final algorithm can converge to a very small maximum strategy to achieve Nash equilibrium. Backgammon program Td-gammon is the use of this method, considering that its tree branches can achieve \ (400\) , the traditional heuristic search can not be as effective as chess to solve such problems. So Td-gammon uses the TD (\ (\lambda\)) algorithm to combine the nonlinear neural network value function, the input is the current position information, the output of the current position value estimation. It uses a self-game to generate a large number of chess samples, at first a game may take hundreds of steps, because the neural network weights are initialized to a very small random value, but dozens of innings will be better after the performance.

However, this self-reinforcement learning is bad for chess, and the search tree is needed to get a better estimate. This requires combining the search tree method with the self-reinforcement learning. Consider the simplest method, in the original TD, our updated direction is the value of the next state, in the TD leaf, our update direction into the next state of the depth of the search value, namely \ (V (s_t, W) = V (i_+ (S_{t+1}), W) \), where \ (i_+ (s) \) refers to a leaf node that reaches a maximum minimum from a node. However, this method is too low for the search value, the better way is to update each search value directly in the search tree, that is, \ (V (s_t, W) = V (i_+ (S_{t}), W) \), but this method needs to search the end of the game.

However, when the branches of the search tree are too large or difficult to construct the value function, the Monte Carlo search is better than Alpha-beta search. We are able to generate large numbers of games through self-games, so there is no need to use traditional search methods, and a search based on simulation experience is obviously more effective. The simplest is the Rollout fast-moving sub-algorithm, the starting strategy can be a completely random strategy, for the current state, starting from each possible action, according to a given strategy for path sampling, based on the multiple sampling of the reward and to the current state of the action value estimates. When the current estimate is basically convergent, the action will be selected according to the principle of maximum action value to enter the next state and repeat the above process. Rollout reduces the breadth of the **search tree** by using sampling, and is fast because the strategy itself is simple, because Monte Carlo simulations are independent of each other, so the algorithms are inherently parallel. Monte-Carlo Tree Search is an enhanced version of the Rollout algorithm, which makes two major improvements, one is that it will record the value estimates in the Monte Carlo simulation to guide the subsequent simulation path towards a higher reward path, the second is to use the strategy into two categories, tree strategy and Rollout strategy. For example, the setting tree depth is 5 layers, the tree policy is used for 5 layers, then randomly goes to the endpoint with the Rollout strategy, then the Monte Carlo method is used to update the values and promote the tree strategy.

At this point, we combed through the traditional game domain algorithm evolution process, next talk about the AlphaGo framework.

Alphago's past Life (a) deep Q network and Game search tree