ArticleDirectory
- Requirements:
- Method:
- Lab materials:
Information theory can be used for some simple natural language processing.
For example, relative entropy is used for classification or relative entropy is used to measure the gap between two random distributions. When two random distributions are the same, the relative entropy is 0. when the difference between two random distributions increases, the relative entropy of the device also increases. The following experiment aims at the difference in the horizontal probability distribution.
Test methods, requirements and material requirements:
1. extract any text to calculate the relative frequency of All characters in the text. Assume that these relative frequencies are the probabilities of these characters (that is, relative frequencies are used to replace probabilities );
2. Take another piece of text and calculate the probability of character distribution in the same way;
3. Calculate the KL distance of character distribution in two text segments;
4. For example (any two distributions of p and q are found), the KL distance is asymmetrical, that is, d (P // q )! = D (Q // P );
Method:
D (P // q) = sum (p (x) * log (p (x)/Q (x ))). P (x) and Q (x) are two probability distributions.
0 * log (0/Q (x) = 0; p (x) * log (p (x)/0) = infinity;
Lab materials:
The two news articles extracted from Phoenix news are:
What exactly does small reunion leak Zhang Ailing? " Secret " ?
Small reunion: a dream of Zhang Ailing
《 1945 Mao Zedong and Chiang Kai-Chiang's secret intelligence war before the Chongqing Summit
the encoding of the three news is UTF-8, the size is around 11 K, are multi-page news.
From the above we can clearly see that both the first news and the second news are about Zhang Ailing's book "small reunion", and the third news is about the civil war between China and the Communist Party of China, obviously, the probability distribution similarity between the first news and the second news is large, so is the experiment result like this? Let's keep our eyes open and wait.