The beauty of mathematics series sixteen (lower)-do not put all the eggs in one basket. The maximum entropy model we talked about last time, we used the maximum entropy model to combine various information. We leave a question unanswered, that is, how to construct the maximum entropy model. All our maximum entropy models are in the form of exponential functions. Now we only need to determine the parameters of the exponential function. This process is called model training.
The original Training Method of the Maximum Entropy model is an iterative algorithm called General iterative algorithm GIS (generalized iterative scaling. The GIS principle is not complex and can be roughly summarized as follows:
1. Assume that the initial model of the nth iteration is an even distribution of equal probability.
2. the N-th iteration model is used to estimate the distribution of each information feature in the training data. If the distribution exceeds the actual value, the corresponding model parameters are reduced; otherwise, they are larger.
3. Repeat Step 2 until convergence.
GIS was first proposed by Darroch and Ratcliff in 1970s. However, the two did not have a good explanation of the physical meaning of the algorithm. Later, it was explained clearly by the mathematician Csiszar. Therefore, when talking about this algorithm, people always reference the two papers of Darroch, Ratcliff, And Xisa at the same time. GIS algorithms take a long time to iterate each time. Therefore, it requires many iterations to converge and is not stable, even if it overflows on 64-bit computers. Therefore, GIS is rarely used in practical applications. We only use it to understand the algorithm of the Maximum Entropy model.
In 1980s, Della Pietra, a very talented twin brother, improved the GIS Algorithm in IBM and proposed the improved iterative algorithm IIS (improved iterative scaling ). This shortens the training time of the Maximum Entropy model by one to two orders of magnitude. In this way, the maximum entropy model may become practical. Even so, only IBM had the condition that the maximum entropy model was used.
Since the maximum entropy model is very perfect in mathematics and tempting to scientists, many researchers have tried to apply their own problems to an approximate model similar to the maximum entropy. Who knows this approximation, the maximum entropy model will become imperfect, and the result can be imagined, it is not much better than the patching method. As a result, many enthusiastic people gave up this method. The first one that verifies the advantages of the Maximum Entropy Model in actual information processing applications is Adwait Ratnaparkhi, an IBM researcher at the University of Pennsylvania Marcus ). Ranapati is clever because he did not approximate the maximum entropy model, but found several natural language processing problems that are most suitable for using the maximum entropy model, with relatively low computing workload, for example, part-of-speech tagging and syntactic analysis. Ranapti successfully combines context information, parts of speech (nouns, verbs, and adjectives), and sentence components (subject and predicate objects) with the maximum entropy model, made the world's best part of speech recognition system and syntax analyzer. Ranapati's paper was published to make people feel refreshed. Ranapti's part-of-speech tagging system is still the best one to use. Scientists have seen from the achievements of ranapati the hope of using the maximum entropy model to solve complicated text information processing.
However, the calculation of the maximum entropy model is still a obstacle. I spent a long time at school thinking about how to simplify the calculation of the maximum entropy model. One day, I told my mentor that I found a mathematical transformation that would reduce the training time of most of the Maximum Entropy Models by two orders of magnitude based on IIS. I deduced it on the blackboard for more than an hour. He didn't find any flaws in my derivation. Then he went back and thought about it for two days and told me that my algorithm was correct. Since then, we have built some large Maximum Entropy Models. These models are much better than the methods used for completing and repairing. Even after I find a fast training algorithm, to train a grammar model that contains context, topic, and syntax information ), I used 20 of the fastest SUN workstations at that time in parallel and still calculated for three months. This shows the complex aspect of the Maximum Entropy model. The implementation of the Maximum Entropy Model's fast algorithms is very complicated. up to today, less than one hundred people have been able to effectively implement these algorithms in the world. Readers interested in implementing a maximum entropy model can read my paper.
The maximum entropy model integrates simplicity and complexity, and is easy to implement. It is worth mentioning that the maximum entropy model is directly or indirectly used in many Google products, such as machine translation.
At this point, the reader may ask, didn't the darpui brothers who first improved the maximum entropy model algorithm do anything over the years? They left IBM in Early 1990s, and now they have withdrawn from the academic world. They joined with many IBM Speech Recognition colleagues at Renaissance Technologies, the world's most successful hedge fund company ). We know that there may be dozens or even hundreds of factors that determine stock fluctuations, and the maximum entropy method can find a model that meets thousands of different conditions at the same time. Scientists, such as the darripple brothers, are there to use the maximum entropy model and other advanced mathematical tools to predict the stock. They have achieved great success. Since the Foundation was founded in 1988, its net return rate has reached an average of 34% per year. That is to say, if you invest a dollar in the Fund in 1988, you can get 200 yuan today. This performance far exceeded Berkshire Hathaway, Buffett's flagship company ). In the same period, the total return of Berkshire Hathaway was 16 times.
It is worth mentioning that many mathematical methods of information processing, including hidden Markov models, wavelet transformations, Bayesian networks, and so on, have been directly applied to Wall Street. This shows the role of the mathematical model.