AlphaGo Gaiden - Machine Learning and Algorithmic Intelligence

A distant relative of Dog Math alphaGo, or Alpha Go, is no stranger to the game. It is an artificial intelligence program developed by a team led by Demis Hassabis, the founder of Google-owned DeepMind. The program has been out in force in 2016 and 2017, leaving Go masters in a quandary, onlookers stacked with lamentations, and geniuses worried about the future. Here we'll dissect AlphaGo's brain-like mechanisms to see just how high its IQ is and why it has achieved such a record.

In October 2015, AlphaGo Fan, the initial version of alphaGo, defeated the triple European Go champion Fan Hui with a proud 5-to-5 record, winning its first victory over a Go professional and putting the defeat of the talented Fan into the history books.

In March 2016, an upgraded version of AlphaGo Lee played against Go world champion and professional ninth-degree player Lee Sedol and won 4-1; from late 2016 to early 2017, a new version of AlphaGo Master, upgraded again, played fast games against dozens of Chinese, Japanese and Korean Go masters on Chinese chess websites under the registered account of "Master" (Master) and won 60 games in a row without losing a single game.

In May 2017, AlphaGo Master played three games against world No. 1 ranked player Ke Jie at the Wuzhen Go Summit in China without a single defeat. Seeing the good, the accomplished AlphaGo team announced that it would no longer compete in Go tournaments, withdrawing from the human Go scene that it had swept away. The general consensus in the Go community is that AlphaGo has surpassed the level of top professional human players. However, the story we have to tell is far from over.

On October 18, 2017, DeepMind released the latest version of AlphaGo, AlphaGo Zero. This Zero crushed Lee after three days and nights of training with a bizarre 100 to 1 record. After nearly forty more days of training, it defeated Master again. What is most striking about Zero is not only its record, but also the fact that it is completely detached from the human game, relying solely on self-training and "outperforming" its predecessors with less computational effort. The secrets of AlphaGo's amazing record are hidden in its brain tissue - the algorithm.

According to the first paper published by the AlphaGo team, any full information game is a search, and the complexity of the search is determined by the width and depth of the search space. For Go, the search width is about 250 and the depth is about 150. Early versions of AlphaGo, including Lee and Master, were based on the key principles of deep learning, reinforcement learning and Monte Carlo tree search.

Deep learning lies primarily in building two networks of trained models: the value network and the strategy network. AlphaGo uses the value network to cut the depth of the search and the strategy network to cut the width of the search, thus greatly narrowing the search space.

A value network is a neural network used to evaluate and measure the current state of the board. Its input is the state of the 19*19 points on the board and its output is the expectation of winning. Theoretically there always exists a high-level function which can be used to calculate this expectation value. Unfortunately, however, no high-level function has been found to compute the expected value, so one is left with a multilayer neuronal network to fit this function. Winning and losing certain board states is easy to determine, and thus their subsequent states do not need to be explored further. The purpose of the value network is to cut the search depth by identifying those board states where winning and losing are easy to determine.

A strategy network is a neural network that evaluates the win rate of each move based on the current state of the board, and selects the maximum win rate move from it. In fact, this is the probability distribution that gives the choice of possible landings. Similarly, there exists a high-level function which can be used to calculate the stated probability distribution. Also for the same reason, one can only fit said high-level functions using multi-layer neuronal networks. Since some landings have extremely low probability of winning and can be ignored, a strategy network can be used to identify ignorable landings to cut the search width.

AlphaGo specifically uses deep convolutional neural networks (CNNs) to train value networks and strategy networks. Neural networks are used to simulate the human or animal brain, using multiple neurons to computationally approximate some complex function together. Any kind of value judgment can be understood as some kind of multivariate function from input to output. It has been shown mathematically that the above multivariate functions can be approximated infinitely using neural networks. Since the Go board can be seen as a 19*19 image, the AlphaGo team chose a deep (i.e., multilayer) convolutional neural network suitable for processing images to build the value network and the strategy network.

Based on deep convolutional neural networks, AlphaGo performs strategy network learning first and then value network learning. Strategy network learning is further divided into two steps; the first step is supervised learning, i.e., learning the past games of humans, i.e., the 30 million landing positions on the KGS Go platform. AlphaGo randomly chooses the position of the move, using past games to predict where a human player is likely to respond to the move and thus determine the win rate for each move. The prediction success rate is 55.7% when using only the fiducial history and location information; adding other factors can increase the success rate to 57%. The second step is reinforcement learning, where the current strategy network is continuously allowed to play against the previous strategy network based on supervised learning of the strategy network, and the feedback from different wins and losses is used to optimize the strategy network. Value network training is similar to strategy network training, although the output is the probability of winning.

In addition to the strategy network and value network, AlphaGo has an important component: the fast tiling analysis module. This module, like the supervised learning strategy network, is trained from human chess games and can be seen as a simplified version of the supervised learning strategy network for quickly obtaining the simulated win rates of nodes during Monte Carlo tree search.

After completing the construction of the value network, strategy network and fast tiling analysis modules through deep learning and reinforcement learning, AlphaGo implements the think search through Monte Carlo tree search, which proceeds roughly as follows.

First assume that the current state of the game is S and that for each possible move a, there is a disc value Q(S,a), an initial probability P(S,a) and a number of visits N(S,a). Then for the current game, Monte Carlo tree search is constantly performed for game simulation.

During a game simulation, suppose that the state of the game when the tth node is reached from the root node (S,a) is S(t), at which point find the move a(t) among all possible moves a such that Q(S(t),a) + u(S(t),a) is maximal. where Q(S(t),a) is the current node's move value, which is obtained by averaging the final win rate of all previous game simulations that have passed through the node, and u(S(t),a) is proportional to P(S(t),a)/[1+N(S(t),a)], where the initial probability P(S(t),a) is the win probability obtained through the strategy network, and the number of visits N(S(t),a) is the number of times the node (S(t),a) has been passed through during all game simulations so far (the number of node visits is introduced here to encourage trying new nodes).

The landing a(t) is then used as the next node in the game simulation, and the process is repeated until a node is never unfolded whose initial probabilities of children have not been calculated in the previous game simulation. Next the final win rate and initial probability are calculated for all children of this node, where the final win rate is obtained by a weighted average of the win probability obtained through the value network and the simulated win rate obtained through multiple simulations of the game by the fast move analysis module. After the best move is selected, the move value and number of visits are updated for all nodes passed by the current game simulation using the final win rate of that move. Finally, finish this game simulation and start the next game simulation.

At the end of the Monte Carlo tree search, as each board simulation is selecting the best node, AlphaGo selects the child node with the most visits under the root node as the next step to play. As for when the Monte Carlo tree search terminates, it depends on how much time AlphaGo has to make its next move.

These are the algorithms used by AlphaGo in several versions prior to Zero. The Zero, which was launched in October 2017, is a much improved game compared to its predecessors, and of course the algorithms used are much improved. Here we look at what improvements have been made to Zero's algorithm.

First, instead of using both the strategy network and the value network, Zero uses the same neural network, inputting the game state and historical moves, and outputting the probability of winning the current game and the probability distribution of choosing possible moves. Then, no more supervised learning is performed, and Zero only performs reinforcement learning, discarding human chess game experience. After that, unlike previous reinforcement learning networks that require training prior to the game, Zero can train itself directly through the game.

Finally, Zero uses a modified Monte Carlo tree search in the chess game to implement the thinking process. While the previous game simulation ended when an unexpanded node was reached, the improved game simulation also ends when the following three scenarios occur: the opponent concedes defeat, the final win rate obtained is below a threshold, and the game simulation length reaches the maximum length specified. At the end of each game simulation, Zero also adjusts the parameters of the neural network based on the results of the game simulation and the results of the neural network computation, updates the neural network and starts the next round of game simulation.

In the process of constantly playing against himself, Zero gained a lot of new knowledge about the game of Go and corrected some human misconceptions about it. In other words, Zero has greatly improved its own gaming ability in the process of exploring Go gaming, and its Go gaming techniques and strategies have surpassed humans in some aspects. There is no reasonable explanation as to how Zero has reached such a high level in such a short period of time.

Although AlphaGo has somehow surpassed humans in the field of Go, it has firmly retired from the Go world and one will never see it play against humans again. However, there is every reason to believe that other AI programs like AlphaGo, such as BetaOx, will emerge and will defeat many geniuses in other fields in the near future. More worryingly, will AI cause many professionals to lose the work they love and leave their jobs helplessly? Should we expect or reject the advent of the age of artificial intelligence?

(This paper refers to two papers by the AlphaGo team, "Mastering the game of Go with deep neural networks and tree search" and "Mastering the game of Go without human knowledge", "From the success of AlphaGo" by Mengdi Zhang and others, "Copying AlphaGo's algorithm and talking about your own thinking" by Haotong Zhao, and "AlphaGo is actually quite "stupid"" by Mr. Chen and the pending word. Images in the text are from the internet and the listed references. )