I have read Wu's "Mathematical Beauty" before. Although this is a popular science-based reading, but still can be benefited from it. Here is a summary based on the memory and the brief written notes that have been done before.
1. The role of information is to eliminate uncertainty, and a large number of natural language processing problems are finding relevant information.
2. About Search: Technology is divided into two types of surgery and Tao. The specific way of doing things is surgery, the principle and principle of doing things is Tao. Only the essence and essence of the search can be mastered.
3. The workflow of the search engine. A search engine roughly needs to do these things: Automatically download as many pages as possible, build fast and efficient indexes, and sort the pages fairly and accurately based on relevance.
4. The above indexes are divided into different grades. Should be based on the importance, quality, and frequency of access to the Web page.
5. Automatically download all the Internet pages, using the graph theory of the traversal algorithm: depth-first traversal, breadth-first traversal.
6. Web crawler technology is used to download the webpage. It can be started from any Web page, with the graph of the traversal algorithm, automatically access to each page, and save them up.
7.Google PageRank algorithm: Simply speaking, it is a democratic vote. On the Internet, if a webpage is linked by many other web pages, it is generally recognized and trusted, then it is ranked high. Web page ranking of high site contributions to the link right significant.
8. Relevance of Web pages and queries: The scientific measure of keyword weighting is the use of TF-IDF technology.
9. Address recognition technology employs a finite state machine-it includes some States (nodes) and arcs connected to those states. The finite state machine based on probability is applied to fuzzy matching.
10. Global navigation uses dynamic programming algorithms.
11. Classification of the news. This article cleverly links the classification of news with the mathematical cosine formula. Consider the news once as a vector-eigenvector. The dimensions of a vector are the total number of words, and each dimension in the vector represents each word's contribution to the subject of the news. Then calculate the angle of two eigenvectors: cos (A) =<b,c>/(|b|*|c|) = (x1y1+x2y2+...+x64000y64000)/(SQRT (x1^2+x2^2+...+x64000^2) *sqrt (y1^2+y2^2+...+y64000^2)). (X1, Y1 indicates that the following table of X is 1,y, and the following table is 1;sqrt (). X1^2 represents the square of X1) the smaller the angle of the calculation, the more similar the two news, the more relevant. You can then calculate the feature vectors of the news to be categorized, and the angle of the feature vectors of the news represented by each category. To find the smallest angle, the news to be classified should be classified in the category of news with the smallest angle.
12. Information fingerprint: Extract the characteristics of the information as a string, find out that the string corresponds to a random number. As long as the algorithm that produces the random number is good enough, it is almost impossible to have the same fingerprint of two strings. The benefits of information fingerprinting: saving space, storing the integer than storing the entire string save more, improve the search efficiency, if the search string, need to match each other, and find the whole type is sorted first and then binary search method, the efficiency is improved a lot.
13. Pseudo-random number generator algorithm-Mason rotation algorithm.
14. Encryption on the Internet requires the use of a pseudo-random number generator based on encryption. The commonly used algorithms are MD5 or SHA-1 and so on. They can change the indefinite length of information into a fixed-length 128-or 160-bit binary random number. SHA-1 was proved to be a loophole by Professor Xiao of Shandong University. Digression: The book was published when Professor Xiao had not discovered the MD5 loophole, so the book did not mention it. Professor Xiao In fact proved that the MD5 algorithm has loopholes, in fact, is not widely circulated to conquer the MD5 algorithm. Instead, we find out the strong collision-free MD5 algorithm. If you want to really crack the MD5 algorithm, you should find a weak collision-free. Set F (x) as the hash function, then weak collision-free refers to: Known a number x, to find the number Y, so that their hash value f (x) =f (y). Strong collision-free refers to finding a pair of numbers x with Y, making their hash value f (x) =f (y). A good cryptographic algorithm should be required to not find strong collision-free and weak collision.
15. Search engine Cheating-the top-ranked sites in search engines are not necessarily high-quality, highly relevant sites. For example, in the Web page hidden fields to add a large number of keywords to improve the rankings. Or buy a lot of links and so on. The essence of cheating is that the noise is added to the signal of the page rank, because the key to the anti-cheating is to go to the noise.
16. Maximum entropy model and maximum entropy principle. The popular saying is that eggs should not be placed in a basket. This is a very simple, beautiful, the only one can meet the constraints of each information source, but also to ensure the smoothness of the model. But the amount of computation is very large. The best way to combine the various features is to use the maximum entropy model.
17. Input method. The speed of entering Chinese characters depends on the average length of Chinese character coding. The so-called average length refers to the number of keystrokes multiplied by the time needed to find the key. The encoding of Chinese characters includes the coding of pinyin and the elimination of ambiguity. While the Wubi input method reduces the number of keystrokes per Chinese character, but ignores the time to find each key, so its average length of coding is relatively long, slower.
18. Shannon's first theorem: For a message, the length of any encoding is not less than its information entropy.
19. Pinyin input method Pinyin to Chinese characters algorithm: each pinyin can correspond to a number of Chinese characters, and a phonetic string corresponding to the Chinese characters from left to right together, is a map. It is called a grid or fence chart. Pinyin Input method is based on the context of the given phonetic conditions to find an optimal sentence. For the above-mentioned grid diagram, the expression of the probability of the graph is converted, and the multiplication is transformed into a continuous addition. Turn the problem into a problem that looks for the shortest path. and the shortest path problem can be solved by dynamic programming algorithm.
20. The fabric filter is used to retrieve whether an element is in a collection. It maps the element to a point in the array through a hash function. So just look at this point is not 1 can judge whether there is a set of it. Bron filter has the advantages of fast and save space, but because it is based on the hash function, and the hash function will inevitably conflict, so the Bron filter will have a certain rate of error recognition. The remedy is to build a small white list that stores elements that may be mistaken for recognition.
21. The Markov chain describes a sequence of states, each of which depends on a limited number of previous states. This model is a very rough simplification for many practical problems. Because in real life, the relationship between things can not be strung together with a chain, the relationship between them may be cross, intricate. This time should use Bayesian network, it is a weighted graph. Markov chain is a special case of Bayesian networks, and Bayesian networks are the generalization of Markov chains. The training of Bayesian networks is a NP problem. Training Bayesian networks, you can use greedy method. In order to prevent the local optimal method, Monte Carlo algorithm can be used.
22. Good methods are often simple in form.
23. The Viterbi algorithm is a special, but most widely used, dynamic programming algorithm that is used to find the-Viterbi path-implied state sequence, which is most likely to produce observed event sequences. It is aimed at the shortest path problem of the directed graph of the fence network, which can be decoded by any problem described by the implied Markov model.
24. A logistic regression model is an exponential model that combines different factors that affect probabilities. As with exponential models (such as the maximum entropy model), their training methods are similar and can be implemented using iterative algorithm GIS and improved iterative algorithm IIS.
25.MapReduce uses the division and treatment method. Splitting a large task into small subtasks, and completing the calculation of subtasks, this process is called map. Merging the intermediate results into the final result, the process is called reduce.
The above is just a superficial review of the whole book. The contents of the whole book, the equivalent of several shells on the beach to the whole beach. If you want to know more about this book, it is better to read the book from beginning to end.
"The beauty of mathematics" Reading notes