Alphago Zero turned out: DeepMind Nature Thesis _ depth Study

Source: Internet
Author: User
Tags scalar
Alphago Zero turned out: DeepMind nature thesis

Thesis Link: http://www.nature.com/nature/journal/v550/n7676/pdf/nature24270.pdf

We speech October 19 14:45 new Chi Yuan abstract

New intelligence Yuan AI World 2017 countdown to enter 20 days, DeepMind released their latest version of the Alphago paper, but also their latest nature paper, introduced to date the strongest version of the latest Alphago Zero, the use of pure reinforcement learning.

New Intellectual Yuan Report

Source: Nature;deepmind

Compiling: Shinfi, Liuxiao

"New wisdom Yuan Guidance" new wisdom Yuan AI World 2017 countdown to the global AI conference into 20 days, DeepMind released their latest version of Alphago papers, but also their latest nature paper, introduced to date the strongest version of the latest Alphago Zero, Using pure reinforcement learning, the value network and strategy network are integrated into one architecture, and after 3 days of training, the previous version of Alphago is defeated by 100:0. Alphago has retired, but technology endures. DeepMind has already completed the concept of Go, and the next step is to create value that changes the world with intensive learning.

In this May Wu Town Weiqi Contest, DeepMind CEO Hassabis said the technical details of the Alphago, which defeated Coger, will be announced later this year. Today, as promised, DeepMind, in a recent paper published in Nature, describes the most powerful version of Alphago--alphago Zero's technical details.

Alphago Zero is completely independent of human data, so the success of this system is also a long-standing goal for AI research-to create a major step forward in achieving an algorithm that transcends human capabilities in the most challenging areas without human input.

The author writes in the paper that Alphago Zero proves that even in the most challenging areas, the method of pure reinforcement learning is entirely feasible: no human sample or instruction is required, no field knowledge is provided outside the basic rules, and the use of intensive learning can achieve a level beyond human beings. In addition, pure reinforcement learning takes only a little extra training time, but it achieves better asymptotic performance compared to the use of human data (asymptotic performance).

In many cases, human data, especially expert data, are often too expensive or impossible to obtain at all. If similar technologies can be applied to other issues, these breakthroughs can have a positive impact on society.

Yes, you might say, Alphago has announced retirement this May, but Alphago's technology will endure and evolve further. DeepMind has completed the concept of Go, and next, is to use their reinforcement learning to change the world.

That's why the paper we're going to introduce is so important-it's not just a technical report that many have been looking forward to, it's also a new technology node for artificial intelligence. In the future, it will get a lot of references, and become the foundation of countless AI industries and services.

The most powerful go program to date: not using human knowledge

DeepMind This latest nature, there is a simple name-"Do not use human knowledge to master go".

Summary

One of the long-standing goals of artificial intelligence is to create a field that is challenging to transcend the mastery of the human learning algorithm, "tabula rasa" (a cognitive concept, that refers to the individual in the absence of innate spiritual content in the case of birth, all knowledge from the acquired experience or perception). Previously, Alphago became the first system to defeat the world champion in Weiqi. Alphago's neural networks use the data of human experts to play chess to monitor learning and training, while also strengthening learning through self playing chess.

Here, we introduce an algorithm that is based only on reinforcement learning and does not use domain knowledge other than human data, guidance, or rules. Alphago became his own teacher. We trained a neural network to predict the winner of Alphago's own drop choice and Alphago's own chess. This kind of neural network improves the strength of tree search, makes the drop quality higher, and the self chess iteration is stronger. From the "tabula rasa", our new system Alphago zero to achieve the Superman performance, with 100:0 of the results defeated the previous published Alphago.

doi:10.1038/nature24270

New Intensive learning: becoming your own teacher

DeepMind researchers introduced Alphago Zero. Video Source: DeepMind, video in English subtitles made by nature Shanghai office

Alphago Zero to get such a result, is the use of a new way to strengthen learning, in this process, Alphago Zero to become their own teachers. The system begins with a neural network that has absolutely no knowledge of the go game. Then, by combining this neural network with a powerful search algorithm, it can play chess with itself. In its own chess game, the neural network is adjusted and updated to predict the next drop position and the ultimate winner.

This updated neural network will then be combined with the search algorithm to create a new, more powerful version of Alphago Zero, repeating the process again. In each iteration, the performance of the system has been improved a little, the quality of the self-chess is also improving, which makes the prediction of neural network more and more accurate, get more powerful version of Alphago Zero.

This technique is more powerful than the previous version of Alphago, as it is no longer limited by the limitations of human knowledge. On the contrary, it can start from a white paper state, from the world's most powerful Weiqi player--alphago itself-learning.

Alphago Zero differs from previous versions in other ways:

Alphago Zero uses only the sunspots and whites on the Weiqi board as input, while the previous version of the Alphago contains a small amount of manual design features.

It uses only one neural network, not two. Previous versions of Alphago used a "Policy Network" (Policy Network) to select the next drop location and a "Value network" (value network) to predict the winner of the game. These are conducted jointly in Alphago Zero, which allows it to be trained and evaluated more effectively.

Alphago Zero does not use the "Walk Calculus" (rollout)-This is a fast, random game used by other go programs to predict which side will win from the current chess match. Instead, it relies on high quality neural networks to evaluate drop locations.

All of these differences will help improve the performance of the system and make it more versatile. But making the system more powerful and efficient is the change in the algorithm.

After 3 days of self training, Alphago Zero defeated the previous version of alphago--in 100 innings and the previous version of Alphago defeated nine-part Korean chess player Li Shi he 乭, who won the Weiqi World Championship 18 times. After 40 days of self training, Alphago Zero became more powerful, surpassing the "Master" version of Alphago--master who had defeated the world's finest chess player, the world's first Coger.

After millions of Alphago vs Alphago, the system gradually learned to go from zero, in a few days accumulated human knowledge accumulated over thousands of years. Alphago Zero has also discovered new knowledge, developed unconventional strategies and creative new methods that transcend the new skills it invented in the game with Coger and Li Shi he 乭.

Although it is still in its early stages, Alphago Zero has become a key step towards that goal. Demis Hassabis, co-founder and CEO of DeepMind, said: "Alphago has made such amazing achievements in just two years." Now, Alphago Zero is the most powerful version of our project, and it shows how much progress we can make with less computing power and no use of human data at all.

"Ultimately, we hope to use this algorithmic breakthrough to help solve the pressing problems of the real world, such as protein folding or new material design." If we can achieve the same progress on these issues as Alphago, it is possible to promote human understanding and have a positive impact on our lives. ”

Alphago Zero Technical details Dismantling: Integrating value networks and policy networks into one architecture, integrating Monte Carlo search iterations

The new method uses a depth neural network fθ, and the parameter is θ. This neural network will characterize the original chessboard s (pieces position and history) as input, output drop probabilities and a value (p, v) = fθ (s).

Drop probability vector p indicates the probability of choosing the next step (including No more). Value V is a scalar valuation that measures the probability of the current player winning in position S.

The neural network integrates the strategic networks and value networks of the initial Alphago (Alphago Fan and Alphago Lee, respectively, against 樊麾 and Li Shishi) into a framework that contains many residual modules based on convolution neural networks, Batch regularization (batch normalization) and nonlinear rectification functions (rectifier nonlinearities) are used in these residual modules.

Alphago Zero's neural network uses self-chess data for training, which is done under a new reinforcement learning algorithm. In each position s, the neural network fθ will perform a Monte Carlo Tree search (MCTS). MCTS Output the drop probability pi for each step of the game. The probability of such a search is usually much stronger than the original drop probability p of the neural network fθ (s); MCTS can therefore be seen as a more powerful strategy to promote operator.

This process can be viewed as a powerful strategy evaluation operator by using the search for self playing, which is the choice of the next move with an enhanced MCTS strategy, and then using the winner Z as a value sample.

The core idea of this new reinforcement learning algorithm is that in the process of policy iterations, these search operator are reused: The parameters of the neural network are constantly updated, so that drop probabilities and values (p,v) = fθ (s) are getting closer to the improved search probability and the self Chess winner (π, z). These new parameters are also used for the next iteration of the self-chess, making the search stronger. Figure 1 below shows the process of self-chess training.

Figure 1:alphago Zero self-chess training process: A. The program itself and play chess, marked as S1, ..., ST. In each position St, a mctsαθ is performed (see Figure 2), using the latest neural network fθ. Each walk is based on the choice of MCTs, at? Πt the calculated search probability. The final position St calculates the final winner Z according to the game rules. B. Training of neural networks in Alphago Zero. The neural network takes the chessboard position St as input, together with the parameter θ, to say that it transmits to many convolution layers, and at the same time outputs a vector pt representing the probability distribution of each of the passes and a scalar value VT that represents the current player's winning rate on the position St.

MCTS uses neural network fθ to guide its simulations (see Figure 2). Each edge in the search tree (s, a) stores a probability transcendental P (S, a), an access number N (S, a), and the Action value Q (S, a). Each simulation starts at the root node state and iterates through the drop results to maximize the upper layer of the confidence interval Q (s, a) + U (S, a) until it goes to the leaf node s′.

The network then expands the leaf node and evaluates it only once, generating a priori and evaluated value (P (s′,), V (s′)) = fθ (s′). In impersonation, after traversing each edge (s, a), the access quantity N (S, a) is updated, and the action value is updated to take the average of all simulations:

MCTS can be seen as a self playing game algorithm: Given the neural network parameter θ and a root node position s, the computational search probability vector is recommended drop π=αθ (s), which is in direct proportion to the traffic index per step, and tau is the temperature parameter:

Figure 2:mcts Using neural network fθ simulate drop selection process to indicate

Neural networks use this self-playing reinforcement learning algorithm to do the training, as described above, which uses MCTS for every move. First, neural networks are initialized with random weights θ0. In each subsequent iteration, I≥1 generates a self-games (see Figure 1, a). In each time step T, run a MCTS search πt =αθ (ST), using the results of the previous neural network fθi?1 iteration, and then sampling the next step according to the search probability. A game of chess at the end of step T, that is, both sides can not drop, the search value down to the threshold below the time. Then, make a score, and draw the reward rt∈{?1,+1}.

Each time step t data is stored as (st,πt, ZT), where ZT =±rt is the winner from the current move T appears to be the ultimate winner.

At the same time (see Figure 1 B), the new network parameter θi is trained using the data obtained from all the time steps of the last self-chess iteration (s,π, z). Adjust the neural network (p, v) = fθi (s) to minimize the error between the predictive value V and the self-contrast winner Z, and to maximize the similarity between the drop probability p and the search probability pi of the neural network.

Specifically, we use the gradient drop of the loss function L to adjust the parameter θ, which is shown as follows, where c is the parameter that controls the regularization level of the L2 weight (preventing cross fitting):

Evaluation results: 21 days is more powerful than master Coger.

DeepMind's official blog describes the contrast between Alphago zero and previous versions. Completely from zero, 3 days beyond the Alphago Li Shishi version, 21 days to reach master level.

Several different versions of the computational force are compared as follows:

In the paper, in order to separate the contribution of structure and algorithm, DeepMind researchers compared the neural network architecture of Alphago Zero with the performance of the neural network architecture of the Alphago (Alphago Lee) of the previous Li Shi he 乭 (see Figure 4).

We built 4 neural networks, the separate strategic networks and value networks used in Alphago Lee, or the merged strategies and value networks used in Alphago zero, and the convolution network architecture Alphago Lee uses, or Alphago Zero use Poor network architecture. Each network is trained to minimize the same loss function (Formula 1), which is trained to use the same Alphago game data set produced by Zero in 72 hours of self-chess.

The use of residual network accuracy is higher, the error is lower, in Alphago to achieve Elo (grade) performance improvement. By combining policy (policy) with value in a single network, the accuracy of the lookahead prediction is reduced slightly, but the value error is reduced, and the Alphago performance is increased by Elo. This is partly due to increased computational efficiency, but more importantly, dual goals make the network a regular representation of multiple use cases.

Figure 4:alphago Zero and Alphago Lee's neural network architecture comparison. Use separate policies and value networks as (SEP), use a combined strategy and value network as (dual), use a convolution network as (conv), and use a residual network as (res). "Dual-res" and "Sep-conv" represent the neural network architecture used in Alphago Zero and Alphago Lee respectively. Each network is trained on the same data set, which is generated by Alphago Zero's own chess. A, each well trained network is combined with Alphago Zero's search to get a different player. The Elo rating is calculated from the evaluation game between these different players, each with 5 seconds of thinking time. b, the predictive accuracy of the professional player's approach (derived from the Gokifu data set) for each network architecture. C, MSE of the chess results (derived from the Gokifu data set) of the human professional chess player of each network architecture.

Alphago Zero to learn the knowledge. A,alphago Zero Five human stereotypes found during training (common corner sequences). (b) The 5 formula that I love in my own chess. (c) The first 80 steps of the 3-time self-chess game at different training stages, with 1,600 simulations per search (approx. 0.4s). At first, the system focused on the son, much like a human beginner. Then, focus on the potential and the ground, that is, go at all. In the end, the whole game showed a good balance, involving multiple battles and a complex battle, which eventually won the majority of the white chess.

Alphago biography

Name: Alphago (Fan,lee,master,zero)

Alias: Mr. Ah, Alpha Dog

Birthday: 2014

Place of birth: London, UK

1

Beat 樊麾

In October 2015, Alphago defeated 樊麾 and became the first computer go program to beat chess players on the 19-way chess board, writing a history, and related results were published in Nature in January 2016

2

Beat Li Shishi

In March 2016, Alphago defeated the cutting-edge professional chess player Li Shishi in a five-chess competition, becoming the first computer Go program to defeat the nine-segment chess player without the help of the son, and then create a history. Korean chess Court awarded Alphago the first honorary occupation nine paragraphs after the five-game competition

3

Ranking briefly beyond Coger

July 18, 2016, Alphago on the Go Ratings website ranked first in the world. But a few days later was Coger to overtake.

4

Alias "Master" Sweeps the chess world

At the end of 2016 to the beginning of 2017, once again strengthened Alphago to "Master" as the name, in the case of not disclosing their true identity, borrow the informal network fast Chess vs Test, challenge the first Class master of Korea and Japan, 60 victory

5

Conquer Coger, become the world's first

May 2017 23 to 27th Wu Town Go Summit, the latest enhanced version of Alphago and the world's first chess player Coger game, and with the eight-segment chess player coordinated combat and duel five top nine chess players, such as five games, to obtain 3 than zero-win record, team warfare and group warfare is also a victory. This Alphago computing resource consumes only the Li Shishi version of the One-tenth. After the game with Coger, China Weiqi Association awarded the title of Alphago professional Weiqi nine paragraphs

Alphago in the absence of human opponents, May 25, 2017, Alphago's father Jemis Hassabis declared Alphago retired. Alphago's research program began in 2014, from the level of amateur players to the world's first, Alphago's chess force to obtain such progress, only spent two years or so.

Alphago has retired, but technology endures.

In this article, I salute Alphago and the people who developed Alphago.

Texas Poker is a more difficult game than Weiqi for AI. New wisdom Yuan World artificial Intelligence Congress invited to overcome the human libratus professional player "cold flutter Master" CMU Professor Tuomas Sandholm,

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.