Alphago algorithm the clearest interpretation of the Chinese

Source: Internet
Author: User

China IDC Circle June 3 reported that the DeepMind team (Google's) Alphago (a go AI) to 4:1 to win the top human professional chess player Li Shishi. How the hell did she play chess?

Alphago in the face of the current chess game, she will simulate (deduce chess) n times, choose the "simulation" the most times to go, this is Alphago think of the best way.

For example, all the places without drop are possible, but in the simulation, the next step on the right 79% times, the choice that step, it is so simple. Later, you will find that the "most" of the "simulation" of the way to go is the statistical "optimal" approach.

1. What is simulation?

Simulation is Alphago himself and his chess, the equivalent of chess player in the head of the deduction, is the chess player said "calculation."

Alphago face the current situation, will use some kind of (below) strategy, oneself and oneself under. There are two strategies: the next few steps (early termination, because Alphago have a certain ability to judge the situation); or down to the final (final situation is relatively simple, for the chess player is simple, for the machine still have some difficulty, but this problem has been basically solved). For the chess player is to deduce chess.

Alphago will simulate multiple times, "more than once." More and more simulations will make Alphago's deduction "Deeper" (1 steps in the beginning, it may be dozens of paces later, and the judgment of the current situation is "more and more accurate" (because she knows the result of the change in the back, she will go back to the previous situation, update the judgement of the previous situation), so that the subsequent simulation "more and more Strong" ( Closer to the positive solution, the more she simulates, the stronger it will be. How did you do that? See how she simulates it.

Note that the simulation here is the simulation of chess (online), there will be a study of the simulation, do not confuse.

2.AlphaGo How to simulate?

Each time in the simulation, Alphago himself and under himself. In each step, a function determines which step to make. The function includes the following aspects: This situation is probably how to lower (point: Policy net), the next step will lead to what kind of situation, I win the probability is how much (situation Judge: value NET and rollout small simulation), encourage the exploration of the Shifa not simulated. There will be an explanation behind these English nouns.

After the simulation, Alphago will remember to simulate the chess game, such as a few steps after the chess game. And calculate this time policy,value. The value is more accurate (relative to the previous simulation or situation) because it is now closer to the final. Alphago also uses these more accurate values to update this function, the function value is more and more accurate, so each step of the simulation is getting closer to the positive solution (the best dismount), the whole simulation is getting closer to the two sides of the optimal dismount (main change, principle variation), Just like the positive diagram on the go book. So far, you've probably learned alphago how she works, and here are just a few details and math.

3. What is that function, so magical?

This function is divided into two parts.

Q is the action Value,u is bonus. Q In fact, after several simulations, Alphago calculates the probability of a step win, which will have a simulation of the future chess (small simulations in large simulations), and estimates. The u includes two parts. On the one hand according to the situation (chess shape) probably judge should have that several steps can walk, on the other hand punish simulate excessive Shifa, encourage to explore other Shifa, do not old simulation one step, ignore other better Shifa.

4.Q (Action Value) what is the specific?

Q looks a bit complicated, in fact, is the simulation n times later, Alphago think she simulates this step to win the average probability.

The denominator n is the number of times this move is simulated.

The molecule is the tax of the probability of winning each simulation (V).

The V includes two parts, and the value net is the judge of the situation. And a quick simulation to the finality of the probability of her winning.

Value NET is to say that she sees this situation, must judge the probability of winning, "forbid" to go down a few steps to think. The value NET is detailed below.

Fast simulation is said she looked at this situation, herself and herself down, to see the black and white who won the high probability. Fast simulation is a small simulation of our large simulation.

Q is looking at the present (value net), also looking at the future (fast simulation), to decide how to simulate (for people is where to think, for the chess player is to think about which possible moves), the Chess Party (in the simulation of chess square black and white are alphago) the probability of winning the next step, thus deciding to simulate the next step.

5.U (bonus) What is the specific?

The u includes two parts.

The molecule is alphago according to the current situation to judge (policy net), does not imitate, for instance the chess player according to the shape probably knows which few steps may walk.

The denominator is the simulation to the current step of the cumulative, the greater the next time the simulation will not go this.

In a word, (q+u) is to decide the simulation, chess party will Go (simulate) where.

To this, we have probably learned the two great artifacts of Alphago: value net (situation judgment: in the simulation, I take this step, I win the probability is how much) and policy net (point of the simulation, this situation I go that a few steps strongest). Below will uncover their veil of mystery.

6. Why do you choose the most number of simulation steps?

According to the above function, the number of simulations is one step, in fact, in many simulations, alphago that the most likely to win the cumulative (or average, divided by the total number of simulations).

7. Why is it divided into policy net (the selected points) and the value net (situation judgment), the sourcing and situation judgment is not a thing?

Indeed, the points of selecting and the situation are nested. First of all, the situation of Weiqi is very difficult to judge. We often see in the go live that the occupation of the 9 paragraph can not accurately determine the current situation, unless the region has been identified, there is nothing to continue to fight the place, is generally close to the final (official sub-stage). Even professional chess players, the points of choosing and judging is also a qualitative component of more, quantitative components are less. Previously said that the Chinese top chess player Coulee can deduce to 50 steps, already very strong.

Moreover, the nesting problem, accurate quantitative points and judgments, it is necessary to calculate (for the chess player is in the brain, for the machine is simulation). In the deduction, I choose to go that step decided to, after this step I win the probability, and this probability is determined by the opponent to go that step (I will assume the opponent to play out her strongest step, to my disadvantage), the opponent took that step decided to, she took that step, her judgment to the situation to her best, it depends on my next step (3rd step) Where to go (opponents she will also assume that I will put out the most unfavorable step for her, natural to me the best), thus constantly nesting, this "knot" to the final (or close) to be solved (final situation to judge relatively simple). Therefore, it is very difficult to judge the situation, even if the occupation is not 9 paragraphs. This is the key to Weiqi is more difficult than chess, it has no simple way of judging the situation, and chess has.

The answer to this question is 7 to see below.

8.AlphaGo How to open this knot?

Alphago did not directly judge the situation, that is, not directly learning value NET, but first to do a sourcing (policy net) program. The point of view can be regarded as a local problem of a time series (moves), that is, judging from the current situation, there are steps that may go, and there is no need to extrapolate (that is, simulation work). The point of the chess players will be deduced, here the basis of policy net is not deduced, before have seen Alphago online simulation of the selected point (Q+u) is deduced.

So policy net is used in "every simulation", search both sides possible, and the optimal step of the judgment is "N-time simulation" task, policy net regardless. In addition policy NET is also used to train the value net, that is to say, the value NET is from policy net, first has the policy only then has the value.

Can the selected point (policy net) be set up? If not, it is useless.

9. How does the first artifact policy net work?

Take a look at this picture first. Now it's black, and the figure on the chart is the probability that Alphago thinks black should take this step. We also found that only a few steps (2 steps in this figure) are more likely, and the other steps are very small. It's like a professional chess player. Learn Weiqi people know that beginners will feel that there can go, that is policy (selection) No, no selectivity. As the force of chess increases, the range of choices is narrowing. Professional chess players will lock down a few of the most likely to go, and then to deduce the future changes.

Alphago through the study, predicts that the professional player's 57% accuracy rate. To remind, this is still Alphago "one eye" looks the effect, she did not start the deduction (simulation). And she doesn't predict that the right is not necessarily worse than the professional chess player.

Policy net How to learn, learn what?

First, policy net is a model. Its input is the current chess (19*19 board, each position has 3 states, black, white, empty), output is the most likely (optimal) of the way, each vacancy has a probability (probability). Fortunately, the Act does not look like the situation to judge so no trace can be found. We have been playing chess for thousands of years. Policy Net first to professional contestants, she from Kgs go server, learned 30 million situation of the next step. That is to say, probably how the professional athlete walks, Alphago she already knew in the bosom. The purpose of study is that she is not simply to remember this situation, but similar situation will also be. When the study situation is long enough, she will be in almost all situations. This kind of study is called "supervised learning" (supervised learning). The former professional chess player's games, is her teacher (supervision).

One of the reasons for Alphago is that the policy net model is done through deep learning (deep learning). Deep learning is a machine learning method of simulating human brain which has arisen in recent years. It makes Alphago learn more accurate policy. Previous AI did not have that strong ability to learn.

Even more powerful, Alphago from professional chess player after learning, feel nothing can learn from professional chess player. In order to surpass the teacher and himself, alone to defeat her only her own left and right, through their own, to find a better policy. For example, she learned a policy,p0 from supervised learning.

Alphago will make an exception to a model P1. P1 starts out like P0 (model parameters are the same). Slightly change the P1 parameters, and then let P1 and P0, for example, black with P1, white P0 Point, until the end (final). After many simulations, if the P1 is stronger than P0 (win more), then the P1 will use the new parameters, otherwise, the original based on the change of parameters. We'll get a little bit stronger than P0 P1. Note that the sourcing is based on the probability of policy, so each simulation is different. Many times after learning Alphago will continue to surpass themselves, more and more strong. This kind of learning is called reinforcement learning (reinforcement learning). It has no direct oversight information, instead, the model is sent to the environment (chess), through interaction with the environment, the environment to the model to complete the task of the good or bad feedback (win or lose), so that the model changes themselves (update parameters), better completion of the task (win chess). After strengthening the study, Alphago in 80% of the chess game to overcome their former self.

Finally, Alphago also has a mini policy net, called rollout. It is used to describe the above simulation, the rapid simulation of the final. Its input is smaller than normal policy net, its model is small, so its time consuming is 2 subtle, and a policy to 3 milliseconds. It's not policy, but it's fast.

Summarize the policy. It is used to predict the next step "presumably" where to go. It uses deep learning, supervised learning, and enhanced learning methods. It is mainly used for the bonus of each simulation (I should probably go), and the study of value net (after the focus).

If simply using the policy prediction as the optimal method, not through the calculation of value NET and the above simulation, the professional chess player that is not possible. However, the simple use of policy prediction is enough to defeat the previous go AI (about 5 pieces of amateur strength). This illustrates the powerful power of the above 3 learning methods.

Alphago just glanced at it, and without deduction, you were defeated. Policy net to unlock that knot out of the first step, let's talk about this second "artifact": Value net.

10. How does the second artifact value net work?

Before said, the situation judgment is what no trace can seek, even occupation 9 paragraph also cannot do. With policy net, the whole world is different. Alphago the core of her soul is in the formula below.

v* (s) =vp* (s) is approximately equal to Vp (s).

S is the state of the chessboard, that is, the 19*19, each crossover 3 states.

V is the assessment of this state, which means the probability of black win.

V* is the true value of this assessment.

P* is the positive solution (the policy that produces the positive solution)

P is the strongest policy net that Alphago has learned before.

If every step of the simulation is a positive solution p*, the result is v*, which explains the equals sign.

If you know v* this function, in the current situation, you have to take the next step (go on average 250 possibilities) after the state s to evaluate, choose the largest v* walk on the line. Weiqi is a perfect solution. But, as I said before, v* does not exist. The same p* does not exist (theoretically, because the search space is too large, the amount of computation is too large to find.) Playing chess on the 5*5 board can be done.

Alphago geniuses Use the strongest poilicy,p to approximate positive solutions to p*, which can be approximated by P's analogue VP v*. Even the VP is just an approximation, but it's already 9 better than the current career. Think of her p from the pros of the law, is you can think of chess she had thought of. And she's constantly making p more accurate. The top professional players would like to take 20-40 steps in the future, there will be mistakes (illusion). Alphago is a simulation to the final, but also very few errors. Oh, my God, how is this man?

The problem of Weiqi is actually a tree search problem, the current situation is the roots, the roots grow branches (the next step is how many possibilities, the board of the space are possible), this is the breadth of the tree, the tree is constantly growing (deduction, simulation), until the leaf node (final, or behind the situation). The root to the leaf, the number of branches (deduced steps) is the depth of the tree. The average breadth of the tree, the greater the depth, the more difficult the search, the more calculation. Go average breadth is 250, depth 150, chess average breadth is 35, depth 80. If you want to traverse the Weiqi tree, it is impractical to search 250 of the 150 times. This is one of the reasons why Weiqi is more complicated than chess. But the more important reason is: chess has a relatively simple manual can make the value function. For example, eat the King (will) is the infinite points, eat a car 100 points, and so on. DeepBlue, who defeated the world champion of Chess in 1997, is the man's hand-designed value. The value of Weiqi is much more difficult than chess. There's no way to hand it. And can only learn by depth.

Before talking about the principle of value, look at the qualitative look at the results of value. As shown in the figure, this is alphago with the value net forecast to take the next step, she won the probability. The empty place is marked by blue, the deeper it shows the higher the probability of Alphago winning. This is in line with the chess theory we learned, and in the absence of combat, 1, 2 lines (the side of the side) and the middle probability are low because they are inefficient. and the probability of most places is nearly 50%. So it's hard to win chess and it's hard to lose. This, of course, excludes the fierce fighting between the two sides.

Here's how to get value net through policy net. With the policy,value is not so elusive, the knot opened. Alphago can simulate (oneself and under oneself, black and white all use the strongest policy), until final. Note that the simulations here are somewhat different from the initial simulations. The initial simulation was alphago in chess (online), used to predict. The simulation here is that she is still learning (offline). The final v* (who wins) is easier to judge. Of course, it's not that easy for machines, but it's difference relative to the localized.

Value NET is also a supervised, in-depth learning model. The results of multiple simulations (who win) provide oversight information for it. Its model structure is similar to policy net, but the goal is different. Policy is where to go next, and value is the probability of winning after this.

To sum up, value net predicts the probability of winning after the next step. cannot be obtained by itself. However, by using the strongest policy to approximate the positive solution, the simulation of the policy to approximate the main change (on the Go book, the hypothesis is correct), the simulation results to approximate accurate situation judgment v*. Value NET uses supervised depth learning to learn the results of simulations. When the value NET is primarily used to simulate (online, chess), the calculation of Q values is the average situation judgment.

To recap the simulations, each step of the simulation is balanced: the analog to the current average of the situation to judge the value net, the rapid rollout simulation to the final situation of judgment, according to the current situation of the selected point of policy, and the punishment of excessive simulation of the same dismount (encourage exploration) and so on. After many simulations, the tree will search more and more wide and deeper. Because of its backtracking mechanism, the Q value is more and more accurate, the following search will be more and more strong. Because each of the Q values is the current simulation of the optimal (excluding encourage exploration, many times will be offset), the most simulated dismount (tree branch) is the entire simulation of the cumulative thought of the best dismount.

So far, alphago her veil of mystery has been uncovered. Her basic framework is shown in the figure below. The online process of playing chess is the red arrow in the figure. The offline preparation (learning process) is the blue arrow. Just a second. Alphago Chess (online) by simulation, each simulation to choose the next step, not a simple point of policy is over, but to refer to the previous simulation of the situation, including: value NET and rapid simulation (small analog) to the end, encourage exploration, policy (transcendental), is (Q+u), It is more accurate than the pure policy. She chooses the most simulated dismount (the average optimal). This is on the line, playing chess. Before (offline), she wants to train good policy models, rollout models and value models. Among them, Policy,rollout can learn from games, and his own chess. Value can be learned from the simulation results of learning policy chess. So that the perfect solution value can not learn the problem and policy and value of each other nested knot. It is not yet possible to learn the value net directly from games.

What technologies are used in 11.AlphaGo?

In the framework of tree search, Alphago uses deep learning, supervised learning and enhanced learning.

The former strongest go AI uses the Monte Carlo tree search method. The Monte Carlo algorithm uses some kind of "experiment" method to wait for the estimation of a random variable to get the solution of a problem. This kind of experiment can be computer simulation. Let's see how the Monte Carlo tree search is modeled. The algorithm will look for two go fools (Computers) who only know that they can play chess (blanks, and not robbing the grapes), and they end up in the final. Well, that's the one to judge who won. The algorithm simulates the probability of black winning by simulating the M (m>>n) disk. You can see that this is obviously unreasonable. Because every step is a mess. Some chess is impossible at all. Even so, this algorithm can reach amateur 5 or so level.

Alphago is not disorderly, she has learned professional chess. So Alphago's search is called Beam search (only a few lines, not a sweep). It is also possible to see Alphago think of several possibilities, not random 250. This is the 250 from 150 to several (<10) of N (n<<150, can terminate the search in advance, because there are value net) of the second side, the amount of computation is reduced. Although Alphago is longer (because the prediction of the depth model is policy and value is not random), the number of alphago simulations can be less than 1/15000 of the Monte Carlo tree search. That means Alphago's search is more purposeful, and she probably knows where to go. The commentary says that she plays chess more like a person. I would say she plays chess more like a professional chess player, even more than a professional chess player. Offline learning makes her behavior (simulation) have a very strong purpose to achieve the ultimate goal (win chess).

12. What is robbery?

Robbery, refers to black and white both sides of each other's pieces around, this situation, if the wheel baixia, you can eat a sunspot; if the turn is black, also can eat a white son. Because it is so reciprocating, there is no solution to the cycle, so go to prohibit "duplicate". According to the rules of the "mention" after a child, the other party can be back to mention the situation cannot be immediately back to the next, to the other side should be the first hand and then back to "mention." As shown in the picture:

Robbery because the same point repeatedly, will make the depth of the search tree increased, and because of other positions will affect the loss of robbery, robbery and mutual influence between, it is possible to rob and create new robbery. In short, the rules of robbery will make Weiqi more complex.

Because the first two games have not been robbed, some people will suspect that DeepMind and Li Shishi have not robbed the agreement. In the back of the chess game, Alphago did take the initiative to rob. And from the algorithmic level, the robbery will not be her simulation framework crashes (there may be some minor trouble).

13. Strong is strong, weak is weak?

The performance of Alphago appears to be strong and weak when weak. This may be due to her learning and monitoring information decision. Policy and value learning, and rollout simulation, the final result is who wins (the probability), not who wins "how much" (Win a few heads). So when Alphago is in the lead (almost normal), she won't go overboard, she just has to make sure she wins, not as much as a person, but as good as a winner. Even if there is a chance to kill Dalong (a big chunk of chess), she does not necessarily kill, but walk the gentle chess, let you die. It is estimated that she will risk taking excessive chess (which seems unusual) only when Alphago judge that she is far behind.

14.AlphaGo chess Why spend money?

Alphago has a stand-alone version, multiple machines (distributed). Distributed significantly stronger than stand-alone. Last year's distribution has 40 search threads, 1202 cpu,176 GPU (video card). There may be more when playing chess with Li Shishi. The operation and maintenance of so many machines is to burn money.

15.AlphaGo is there a loophole?

Alphago solves the problem of a tree search, not the possibility of traversing all the moves, her approach is just close to the positive solution, not a certain positive solution.

The simplest way to beat Alphago is to change the rules, such as enlarging the chessboard. Humans can be relatively simple to adapt, the search space increases, alphago may not necessarily be able to adapt.

As far as the current situation is concerned, the chess player can mainly attack the Alphago in the simulation of the method of selecting function A. For example, as far as possible under the overall situation of the chess (more robbery, many lives), is as far as possible is a complex, do not engage in a part of this (a road to go to the end) local approach, of course, this is not easy for professional players.

What are the 16.AlphaGo technical breakthroughs that enable her to beat the top players in the human race?

It inherits the Monte Carlo tree Search framework for simulation.

In the study policy use the supervisory study, the effective utilization existing chess player's games, has learned their sourcing strategy.

In learning policy use of enhanced learning, from the left and right to improve themselves.

Using Policy net (the selected points model) approximate positive solution, using the result of policy net to simulate the result of the chess, that is, the correct situation judgment, thus breaking the situation judgment and the knot of the points nesting. is to learn policy first, and then learn value.

Use the depth learning model in learning policy, value, and rollout. Deep learning has a very strong ability to learn. Make the selected points and the situation to judge unprecedented (compare Monte Carlo is a random selected points, now is a professional chess player to help her pick out). Because these two "quasi" are used in each simulation, the process of tree search (which is the deduction) is more purposeful (the tree is heavily reduced, only simulating the better dismount).

Of course, there are always the advantages of the machine, not fatigue, not affected by psychological emotions, can not be wrong memory and so on

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.