Deng Jidong Column | The thing about machine learning (IV.): Alphago_ Artificial Intelligence based on GPU for machine learning cases

Source: Internet
Author: User

Directory

1. Introduction

1.1. Overview

1.2 Brief History of machine learning

1.3 Machine learning to change the world: a GPU-based machine learning example

1.3.1 Vision recognition based on depth neural network

1.3.2 Alphago

1.3.3 IBM Waston

1.4 Machine Learning Method classification and book organization


1.3.2 Alphago

In the past few years, the Google DeepMind team has attracted the attention of the world with a series of heavyweight jobs. Prior to the acquisition of DeepMind, Google had already formed a deep accumulation of deep convolution neural networks. DeepMind creatively to transplant the deep convolution network into the enhanced Learning (reinforcement Learning) framework, with the depth of learning to transform enhanced learning methods, so as to create a new miracle of genius.


DeepMind's first important task was to let Ai learn to play games on the Atari 2600 (Atari 2600) game consoles. Atari 2600 game is the Atari company launched 1977 handheld game machine, providing a dozen bricks, bees, Pacman, Donkey Kong and other classic games, is the second generation of electronic games representative host, I believe that many senior gamers also have a deep impression. The screen resolution for Atari 2600 is 210x160 pixels, and each pixel supports 128 color values. Game control includes a joystick (top, bottom, left, right, top left, top right, lower left, lower right 8 direction movement) and a joystick button (this button can be used alone, can also be combined with joystick use). Figure 1-13 is the screen of three typical Atari 2600 games.



Figure 1-13 Atari 2600 games (Dongkey Kong from left to right), Space Invaders (spaces invaders) and traps (pitfall))


Before DeepMind work, the deep convolution neural network has shown astonishing power. However, in general, traditional convolution neural networks can establish relationships from input to output at this time, pay attention to the "now" judgment; however, in the Atari 2600 game to high score, you need to find a continuous sequence of operations, where each operation of the overall game effects generally need to be at the end of the game to learn, In other words, it is necessary to establish the current one-step operation and a whole game score relationship. The standard machine learning method that solves the Atari 2600 game is enhanced Learning (rinforcementlearning, also often called reinforcement learning), such as the q-learning algorithm, as a basic reader can already see. The idea of the q-learning algorithm is to have the machine automatically try to select an action to jump from one state to the next, until the target State (such as the game ends or accumulate a certain amount of time), and then determine the impact of each action on the overall goal (i.e. return). The disadvantage of the reinforcement learning is that the stability is poor, especially when the return relationship of the action and the overall goal is non-linear, it is easy to cause the problem of not convergence.


The core idea of DeepMind's work is to transform the deep convolution neural network so that it can carry out intensive learning. Fig. 1-14 is a schematic diagram of the convolution neural network used for this work. Compared with fig. 1-9 in 1.3.1, we can see that there is no intrinsic change in the neural network, and still follow the basic style of the convolution layer and the full connection layer. Our goal is to link the game status sequence (which can be understood as a number of consecutive changes to the game scene) and the control actions that should be taken. This state refers to the overall situation of the game, including game scenes, the distribution of computer-operated objects and the output of the action, as well as the state of objects operated by the player. The depth neural network formed by the training process can output the next game operation according to the current state sequence (i.e. the current state and some previous states), in which the action of the player controlling the object is done. Because the convolution neural network is designed for image processing, it is necessary to establish a proper model of the game so that the convolution neural network can be processed.



Figure 1-14. DeepMind of convolution neural network used in the Atari 2600 game problem (according to [6] change painting)


The DeepMind method is to use the game output image sequence as the input of the neural network. In order to reduce the complexity, the 84x84 resolution image is used, each 4 frame is a sequence. These sequences are processed sequentially by three convolution layers, respectively, to extract the corresponding characteristics. In general, the result of enhanced learning should be a Q function that describes the mapping of operations and overall returns. In the convolution neural network of FIG. 1-14, DeepMind does not use a single Q function value as output, but rather outputs the Q function value of each operation, so that the output can be found by one forward propagation.


The training process is organized into a number of game processes. Each process consists of a series of discrete time steps. In every practical step, the training algorithm chooses the game operation randomly for the current scene or selects an operation from the existing sequence of operation, observes the evolution result of the game situation after the operation executes, evaluates the selected operation, and optimizes the neural network parameters according to the evaluation results. Figure 1-15 is an example of a change in score during the training process. In bricks-and-play, there is no essential difference in the operation at first, but once the bricks break the top brick and bounce back multiple times, the score increases significantly, so the depth neural network tends to produce an operation that is capable of reaching the upper layer of bricks. DeepMind uses a single neural network to train 49 Atari 2600 games and surpasses the human player (game Master) in more than half of the game without any knowledge of the game.



Figure 1-15. Effective operational value assessment (according to [6])


DeepMind, next goal: Impact Weiqi game. The complexity of Weiqi lies in two aspects. First of all, the possibility of more than a staggering. Generally speaking, chess game and Weiqi games can have BD kind of chess sequence (that is, the two sides alternate to walk until the end of the game complete sequence), where b for the game's search breadth, that is, how many possibilities each step, for Weiqi, the probability of the average 250[human][1],d for the game depth, That is, the number of steps passed before the end of the game, by games statistics, this number is about 150, so the total complexity is staggering 250150. As a comparison, the search breadth of chess is 35, depth is 80. Secondly, the overall situation is difficult to judge. There is no direct comparison between chess pieces (for example, the Queen is obviously much more powerful than a soldier), and winning depends on the size of the area controlled by all the pieces. Therefore, Weiqi Masters need to rely on gifted talents and years of training to form a precise intuition of the situation and establish the relationship between the hand and the end result. In this sense, learning chess is indeed a complex neural network training problem, but people rely on the pulse neural network and a variety of electrochemical processes based on the learning process.


In the long journey to become a master, chess players need to solve a few problems: first, if only consider the current plate, the best way to deal with. Second, everyone knows that with the master chess can quickly improve their chess force, then how to obtain high-quality sparring opponents. Third, in the possible dismount, which step is most likely to bring victory. Alphago presents a sophisticated solution that integrates the three key technologies of deep learning, enhanced learning, and the Monte Carlo search tree based on the value network (Monte Carlo) to solve the above problems through off-line learning + online chess.



Figure 1-16. Alphago Training Process Diagram (according to [7] change painting)


Figure 1-16 is a sketch of the Alphago training process presented by the Alphago team in a paper published in Nature Magazine [7]. First, Alphago uses the games of professional chess players as input to train a strategy network constructed by a deep neural network that takes the current disk as input and outputs the probability that the next point falls on all the checkerboard intersections. The representation of the chessboard situation seems complex, in fact can be very simple to achieve: Alphago use of checkerboard photos as input, the disk surface problem into the pattern recognition problem. This step actually trains two strategy networks, one with fast, low precision, and another with high precision and low speed. Since the above network training is a supervised learning process, this network is called supervised Learning Strategy network. After more than 30 million disc training, the strategy network can predict the drop with 57% accuracy according to the disk surface. Of course, the optimization of the current drop and final victory is not the same thing, the Masters will be out of the loss of current interests and optimize the ultimate victory probability of the hands, how to achieve the ultimate optimization, it depends on enhanced learning. Second, Alphago training an enhanced learning strategy network, which takes random results as a starting point, playing chess with the previous training strategy network, and changing the results of the network through the results of the game, the network with input sequence as input, the same output next to all the board intersection probability. The third step of training needs to solve the most difficult problem, that is, based on the current disk to determine the final results. Alphago first uses the supervised learning strategy network to produce a chess game with random steps, this situation as a neural network input, and then use the Enhanced Learning Strategy network for the self-chess, the outcome as a label training a value network, the network response to the current disk situation in the case of the potential benefits of chess.


After training, we now have two supervised learning strategy networks and a value network. Alphago uses the Monte Carlo search tree to achieve online chess. The Monte Carlo search tree is essentially a decision tree, meaning that every single piece of chess will produce several possible branches. To avoid blind traversal search, the Monte Carlo search tree uses targeted sampling methods to reduce unnecessary search space. Alphago for specific disk situation, the use of rapid monitoring Learning Strategy Network and value network output weighting and determine which direction should be in depth search (that is, explore the next step of the possible results), and ultimately choose the optimal drop decision.


The above training and chess process calculation needs are extremely high. The Alphago team designed a sophisticated parallel computing engine that uses 48 CPUs and 8 GPU to compute, while the distributed computing engine uses 1202 CPUs and 176 GPU.


Alphago's record is already familiar, 2015 Alphago defeated the professional Weiqi player, March 2016 in the five chess competition with 4:1 defeated the Weiqi world champion Li Shi he 乭, thus was awarded the honorary occupation nine paragraph by the Korean Chess Court, July 2016 Alphago has ranked first in the world in the League of Weiqi Masters.


Alphago's achievements should be seen in several ways. First of all, Alphago does not train from the go knowledge itself, that is to say, it is the strongest evidence of the validity of deep machine learning that it is not directly learning prior domain knowledge, but directly judging the disk surface by image way. Many commentators believe that Alphago is not thinking, I am afraid, is too conceited for human thinking. Thinking may also simply be the combination and integration of a large number of pulses in the neural network of the brain, in fact, the learning process of human players is also to allow their own pulse neural network to form an effective go intuition, I am afraid at this level and Alphago is not the essence of the difference. Second, humans do not have to belittle themselves. Our brains are far less capable of computing than Alphago computers, capable of using less than 20 watts, and only one-tenth of a GPU. The top players have seen more than 30 million plates in their lifetime. Alphago should also continue to learn from humans, especially based on small samples and even a single sample of training. For example, the human player can experience infinity from the Masters ' famous games (such as the Wu Qingyuan 1933 vs. show Celebrity, the first set of new layouts), and a simple set of deep learning engines has very limited meaning. Third, even if you only watch Weiqi, Alphago is not invulnerable. Although Alphago from 30 million sides, however, this is bucket with the possibility of Weiqi 250150. At the same time, the convolution neural network mechanism decides that it is more effective to extract the local mode [2], while the search time of Monte Carlo search tree relies heavily on the size of the search space, so it will seriously affect the accuracy of Alphago decision if we try to get rid of the Shifa that have influence on many pieces of Fig. 1-17 The winning triangle of the white son is Li Shi he 乭 in the 4th inning with the Alphago the 78th step out of the hands. Before this step, it was generally agreed that the situation was favourable to alphago or substantially equivalent. The 78th hand has the profound latent influence to four white chess, the Alphago value network has not fully judged this move the value, ultimately causes the defeat. Therefore, Alphago's success will also impel the human chess player to design the new game thought, uses the more gorgeous, has the bigger picture chess strategy, has injected the brand-new vigor for the ancient Weiqi games.



Figure 1-17. Li Shi He 乭 the skilful hands of the 4th inning of the Alphago



[1] There are 19 lines on the chess board, so there is a 19x19 intersection, first-hand possibility of 361, the second hands have 360, the third hand 359, and so on. Generally there is no need to go down to the board to fill, so a reasonable estimate is on average 250 possibilities per hand.

[2] Therefore, in the delicate local fight, it is likely that humans will not be able to effectively contend with Alphago.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.