**26. Viterbi and his Viterbi algorithm**

Viterbi algorithm is the most commonly used algorithm in modern digital communication, and it is also a decoding algorithm used by many natural language processing.

First we talk about the **Viterbi algorithm** . He and Jacob co-founded Qualcomm, proposing the CDMA standard. The Viterbi algorithm is proposed for the directed graph shortest path of the fence network, and is a special but most widely used dynamic programming algorithm, which can be used to decode all the problems described using the hidden Markov model. The Viterbi algorithm is then described in detail, indicating that its complexity is O (N· d^2), where n is the mesh length and D is the grid width.

Then it introduces the **basis of CDMA technology -3G mobile communication** . The two most important contributors to the invention and popularization of CDMA (Code Division Multiple Access) technology are Lamar and Viterbi. Before CDMA, mobile communication used two kinds of technologies: Frequency division Multiple access (FDMA) and time Division multiple address (TDMA). Spread spectrum transmission is carried out on a wide extended frequency band, it has the advantage of three points: the anti-jamming ability is extremely strong, the signal is difficult to intercept, the utilization bandwidth is more full.

**27. God's algorithm: Expectation maximization algorithm**

First, the **self-convergent classification of text is** discussed. The problem of classifying new text by pre-defined categories requires a pre-defined category and text center, while the method of clustering the text 22 from the bottom up is longer. The self-convergent classification is to randomly pick out the centers of some categories and then refine them so that they are as consistent as possible with the real cluster centers. The specific process is:

1) randomly select K (number of classes) points, as the starting center C1 (0),...., ck (0).

2) Calculate the distance from all points to these cluster centers and place these points in the nearest class.

3) Recalculate the centers of each category.

4) Repeat the process until the offset between the new center and the old center is very small, i.e. the process converges.

Finally, **extended reading: The inevitability of maximization and convergence of expectation** . In the general case, if there are very many observational data (points), similar to the above method, the computer will iterate to learn a model. First, according to the existing model, the calculation results of each observation data input to the model are calculated, which is called the expected value calculation process, or e process, and then the model parameters are recalculated to maximize the expected value, which is called the maximized process, or M process. This type of algorithm is called EM algorithm. If a convex function is obtained when the objective function is optimized, then the global optimal solution will be guaranteed.

**28. Logistic regression and search ads**

First, we talk about the **development of search advertising** . There are three stages: the first stage is the bid rank advertisement, the second stage is the comprehensive bid and the click-through rate (click Through rate,ctr) and other factors determine the advertising delivery, the key technology is to predict the probability that the user may click the candidate ad, or called Ctr Estimate The third stage is further global optimization. The best way to estimate Ctr is based on past experience, but there are a lot of deficiencies in this approach, and then the logistic regression model (logistic Regression or logistic model) is widely used in industry.

Then we talk about the **logistic regression model** . It refers to the probability of an event appearing gradually adapted to a logical curve (s-shaped curve, beginning to change fast, gradually slowing down, and finally saturated), the variable range is infinite and the domain value between [0,1]. A simple form of logistic regression function is: f (z) = 1/(1+e^-z). Then a simple example is given to explain the logistic regression model, which has two tips: one is the information related to the advertisement click, and the second is the training of the parameter value. The logistic regression model is an exponential model that combines different factors that affect probability, and is similar to the training methods of many exponential models, which can be implemented by using GIS and IIS.

**29, Conquer algorithm and the foundation of Google Cloud computing**

One of the keys to cloud computing is how to automate the decomposition of a very large computational problem into a computer that is not very powerful in computing power. Google gives a program called MapReduce, whose fundamental principle is the divide-and-conquer algorithm (Divide-and-conquer), which is called the "conquer" method in this book.

First, the **principle of divide-and-conquer algorithm is** discussed. A complex problem is divided into several simple sub-problems, and then the result of the sub-problem is merged to get the solution of the original problem.

Then we talk about the **algorithm from divide and conquer to MapReduce**. The first example of a large matrix multiplication is explained, which leads to the basic principle of mapreduce. Splitting a large task into small sub-tasks and completing the calculation of the subtasks is called a map, and the process of merging the intermediate results into the final result is called reduce. How to ensure that each server is load balanced and how the return value is combined is what MapReduce does in engineering.

**30. Google brain and Artificial neural network**

First, we discuss the **artificial neural network** . It is a special direction graph, the particularity lies in:

1) All nodes are layered, each layer of nodes can be pointed to the previous layer of nodes through a directed arc, but there is no arc-phase connection between the same layer nodes, and each node cannot cross a layer connected to the upper level of the node.

2) There is a value on each arc, according to which the values of the nodes they refer to can be calculated with a very simple formula.

The bottom layer is called the input layer, the topmost layer is called the output layer, the middle is called the middle layer, because the external is not visible, so also called hidden layer. This is followed by an example of speech recognition. The basic principle of the artificial neural network is that the values from the input nodes (X1,..., xn) Follow the weights of the arcs they output (W1,..., wn), linearly weighted according to the formula G=w0+x1 W1+...+xn wn to get G, and then do it again (only once) function transformation F (G ) to the second-level node Y, and the second-level node to pass the value to the back, until the last layer is the output layer. When a pattern is classified, the eigenvalues of a pattern are passed back from the input layer, and at the end of the output layer which node is the maximum value, the input mode is divided into which category.

In the artificial neural network, there are only two parts that need to be designed: one is its structure, that is, the network is divided into several layers, several nodes per layer, how to connect between nodes, and so on; the other is the nonlinear function f (·) The design of commonly used functions when exponential functions. If the value obtained on different output nodes is regarded as a probability distribution, then it is equivalent to a probabilistic model.

Then we talk about **training artificial neural networks** . There are two types of supervised training and unsupervised training. For supervised training, a set of parameter w is found according to the training data, so that the output values given by the model are as consistent as the output values designed in the training data. The cost function can be introduced to transform it into an optimization problem, and the most common solution is gradient descent method. For unsupervised training, it is necessary to define a new cost function, so that the same class of samples should be closer to each other, the samples of non-homogeneous sample should be close to each other, and then the gradient descent method can be used for training.

The **relationship between artificial neural networks and Bayesian networks is** also discussed. The similarities are:

1) All are graph, follow Markov hypothesis (the value of each node depends on the node of the previous level only).

2) The training method is similar.

3) For many pattern classification problems, the effect is similar.

4) Training calculations are particularly large.

The difference is:

1) Artificial neural networks are completely standardized in structure, while Bayesian networks are more flexible.

2) The artificial neural network can only perform linear transformation and non-linear transformation, so it is easier to realize, and the Bayesian network is more complicated and unconstrained.

3) The artificial neural network output is relatively independent, it is difficult to process a sequence, it is often used to estimate the parameters of a probabilistic model, not used as a decoder, and Bayesian networks are more likely to consider correlation, so you can decode an input sequence.

Finally, the **extension reads: "Google Brain"**. Early artificial neural networks are limited, because if the scale is small, it cannot keep up with large computational quantities. The rise of cloud computing in 2010 years makes it necessary to change the way the artificial neural network is trained. The innovation of Google's brain is the use of parallel processing technologies for cloud computing. The reason Google's brains use artificial neural networks is:

1) In theory, we can draw the boundary of pattern classification of various shapes and have good universality.

2) The algorithm is very stable and has not changed much.

3) Easy to implement in parallel.

Google brain Parallel training of the parameters of Artificial neural network, in the reduction of computational capacity has made two improvements: first, the use of random gradient descent method, only randomly extract a small amount of data to calculate the cost function, greatly reduce the computational capacity, the second is the use of L-bfgs method to reduce the number of iterations of training. For their storage problems, the input training data is stored locally on the input side, and the model parameters for each server are stored separately by another set of servers.

**31. The power of Big data: talking about the importance of data**

First we talk about **the importance of data** . Three examples show that data is important not only in scientific research, but in all aspects of life.

Then we discuss the **statistics and information technology of the data** . This paper introduces the Chebyshev inequality, and shows that in information processing, it is necessary to support a lot of data to deal with the probability problem. In addition to requiring sufficient data volumes, statistics also require that the sampled data be representative. This is then illustrated by an example of a US presidential election. Now the data has become the first factor in determining the quality of search engines, the algorithm is next. In the search for many data, the most important of the two categories, namely the Web page itself and user-clicked data, so in the search industry formed a Matthew effect. But after entering the search market companies are also taking other methods to quickly obtain data: first, the acquisition of traffic, and the second is through the search bar, browser and even input method to mobile phone users click behavior. So the search quality competition is converted into a browser or other client software market share of the competition. Then with the rapid development of Google Machine translation system to understand the data-driven statistical model of the translation system is better, with the example of speech recognition to illustrate that the field of significant improvements due to a large number of data.

Finally, we talked about **why big data is needed** . Big data need not only the large amount of data, but also its multi-dimension and completeness. Then use a lot of genetic and disease examples to illustrate the importance of big data to the healthcare industry. Finally, the author makes a summary:

1) Only when the combination of some random events appears many times, can we get meaningful statistical laws.

2) The acquisition process of big data is a natural process, which helps to eliminate the deviation of subjectivity.

3) Only large data of multiple dimensions can discover new laws.

4) Big data can solve some of the challenges outside the IT industry.

**Appendix, Computational Complexity**

If a problem exists with a polynomial complexity algorithm, this problem is called P-problem (polynomial), and such problems are considered to be "effectively" solved by the computer. If the calculation of an algorithm is higher than the polynomial function of n, although theoretically there is enough time to calculate, but the actual is not possible, these are called non-polynomial (non-polynomial) problems. In the non-polynomial problem, there is a kind of problem that the non-deterministic polynomial (nondeterministic polynomial, abbreviated NP) has been paid great attention. The problem of computational complexity is at least np-complete or even larger, known as Np-hard.

At this point, the book is over. In general, this is a popular science-based book, which will be some of the advanced advanced technology with very popular language to explain, both ordinary readers or technical Daniel, can benefit from this book . I personally think this book is very good in writing, and will often look back in the future.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

The beauty of Mathematics (second edition) (vi)