JAVA Implementation of relative entropy (relative entropy or Kullback-Leibler divergence, KL distance) (1)

Source: Internet
Author: User
ArticleDirectory
    • Requirements:
    • Method:
    • Lab materials:
Information theory can be used for some simple natural language processing.

For example, relative entropy is used for classification or relative entropy is used to measure the gap between two random distributions. When two random distributions are the same, the relative entropy is 0. when the difference between two random distributions increases, the relative entropy of the device also increases. The following experiment aims at the difference in the horizontal probability distribution.

Test methods, requirements and material requirements:

1. extract any text to calculate the relative frequency of All characters in the text. Assume that these relative frequencies are the probabilities of these characters (that is, relative frequencies are used to replace probabilities );

2. Take another piece of text and calculate the probability of character distribution in the same way;

3. Calculate the KL distance of character distribution in two text segments;

4. For example (any two distributions of p and q are found), the KL distance is asymmetrical, that is, d (P // q )! = D (Q // P );

Method:

D (P // q) = sum (p (x) * log (p (x)/Q (x ))). P (x) and Q (x) are two probability distributions.

0 * log (0/Q (x) = 0; p (x) * log (p (x)/0) = infinity;

Lab materials:

The two news articles extracted from Phoenix news are:

What exactly does small reunion leak Zhang Ailing? " Secret " ?

Small reunion: a dream of Zhang Ailing

《 1945 Mao Zedong and Chiang Kai-Chiang's secret intelligence war before the Chongqing Summit

the encoding of the three news is UTF-8, the size is around 11 K, are multi-page news.

 

 

 

 

From the above we can clearly see that both the first news and the second news are about Zhang Ailing's book "small reunion", and the third news is about the civil war between China and the Communist Party of China, obviously, the probability distribution similarity between the first news and the second news is large, so is the experiment result like this? Let's keep our eyes open and wait.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.