Analysis of the "dream of Red Mansions" with Python: witness the rise and fall of Jia's mansion, whether you can "smile" the Vicissitudes of the world __python

Source: Internet
Author: User
Tags pear

Have not read "A dream of Red mansions" can also know whether the back and forth 40 times is not a writer wrote. A long time ago, data man Li Chen, using machine learning algorithm to analyze the "dream of Red Mansions", that the 40 and the first 80 back to the content of the obvious gap. However, the data-man building does not think so, he felt that the original method of determination is not rigorous, so he used no dictionary participle, excluding the impact of the analysis of the plot, again using machine learning algorithm analysis of this literary masterpiece.


This article authorizes to switch from DT data Man (Id:dtdatahero)

Author | Data Knight Building



Building Full-text indexes with Full-text dictionaries


Over the past two months, I have learned some text processing through the Internet, using natural language processing and machine learning algorithms to do some analysis of the dream of Red mansions. I found some interesting discoveries in the process.


I started doing this because I saw a very interesting article, probably the content is, the author uses "Stammer participle" This open source software statistics The occurrence of each word in a dream of red mansions (that is, the frequency), and then use words as the characteristics of each chapter back, finally using the "principal component Analysis" algorithm to map each chapter back to three-dimensional space, So as to compare the terms of each chapter back to how similar. (DT June Note: The data Li Chen the original "Never seen a dream of red mansions, how to use machine learning to determine after 40 times is not written by Cao Xueqin)" the author's conclusion is that the following 40 words and the first 80 back has a significant gap.


I think the article has two small problems: first, the author of the stuttering word in the dictionary is based on modern text of the corpus, and "dream of Red Mansions" is Chanven, so the accuracy of the word segmentation is doubtful; second, although the author of "The Two Kingdoms" made a comparison, However, there is still no strong evidence that the word differences are not affected by the changes in the plot. So I decided to do the experiment, with no dictionary participle method to participle, and try to eliminate the impact of the analysis of the plot, see if the results will be different.


I need to create a Full-text index before working on the article. This is to quickly find the original content, accelerate the subsequent calculation. I used the suffix tree as the index, using the Ukkonen algorithm to quickly create the entire "dream of red Mansions" suffix tree (Ukkonen algorithm is very fast, in a professional language description, its time complexity is O (n)). So we have the Full-text index.


Next we are going to build a dictionary.


Wait, aren't we going to have no dictionary participle, and why do we have to make dictionaries? In fact, no dictionary participle is not completely without the dictionary, just say that the dictionary is based on the original text, rather than in advance production. In order to make participle, we still need to find out what content in the article like a word, to determine how to do the segmentation.


So how do you know what's like a word? The easiest way to think about it is to take all occurrences of high fragments as words. Sounds reasonable, so we can try, using the suffix tree to query all the repeated fragments in a dream of red mansions, and then order by the number of occurrences:


Bao Yu (3983), laughing (2458), wife (1982), What (1836), Fengjie (1741), one (1697), Jia Mu (1675), one (1520), also not (1448), Mrs. (1437), Dai Yu (1370), we (1233 ), there (1182), the attack (1144), the Girl (1142), went to (1090), Bao Chai (1079), do not know (1074), Mrs. Wang (1061), Up (1059)


Above is the occurrence of the first 20 frequencies, the number of occurrences in parentheses. Can see the effect is not bad, many fragments are words. However, in the sixth place "a" is clearly not a word, the occurrence of the number is higher than the MU Jia. It can be seen that such a screening method still has a certain problem. Also, there are a lot of things that are mistaken for words, such as "the" or "one".


In order to exclude such a combination, we can use the "solidification degree" for further screening. The solidification degree can eliminate the effect of the frequency of the word on the combined frequency. After the experiment, I found the overall effect is still good.


DT June Note: The degree of solidification refers to the frequency of the occurrence of a fragment is more than the product of the frequency of the left and right two parts respectively. It is noteworthy that the frequency represents the occurrence of the ratio, while the number of times represents the occurrence. The idea of solidification is that if the probability of the actual occurrence of the fragment is many times higher than the probability of being randomly assembled, it means that such a combination should not be accidental, but has some relevance. This association is probably because this fragment is an indivisible whole, which is the word.


However, the degree of solidification also has a certain problem. We will find that there are still a lot of fragments are half a word, and also has a high degree of solidification. For example: "Xiang Yuan" (the complete word should be "lei Xiang Yuan"), "wife" (the complete word should be "Mrs. Old Lady").


It makes sense to think that these fragments, though they are half a word, are indeed "frozen" together as complete words. Therefore, the light is not enough to see the degree of solidification, but also through the context to determine whether the word is complete.


In order to eliminate the incomplete words, we can use the freedom to continue to filter, the degree of freedom is described by a fragment adjacent to how the word is not fixed, a real word should be linked to each other should be

It is unique and does not appear to be the case as described above. In other words, if the fragment has a higher degree of freedom, it means that the word should be complete.


DT-June Note: The idea of freedom is that if a combination is an incomplete word, it always appears as part of a complete word, so the adjacent word is more fixed. For example, "Incense Courtyard" appeared in the original 23 times, and "pear Incense courtyard" appeared 22 times, that is, "pear" in the "incense courtyard" on the left side of the frequency up to 95.7%, so we are sure that "incense courtyard" is not a complete word. The degree of freedom describes how diverse and invariant the adjacent words of a fragment are.


With these clear criteria, we can sift through the words. My final choice of criteria is: the number of occurrences is greater than 5, and the degree of solidification, left degrees of freedom, the right degree of freedom are greater than 1. However, this standard is still too loose. So, I designed a formula to synthesize this data:

In other words, I multiply the degree of solidification and the degree of freedom as a fraction of each fragment simply and rudely. So as long as one of the standard values is lower, the total score will be relatively low. So I have one more standard of judgment: The total score is greater than or equal to 100.


After the selection of layers, the Word table was initially formed. I randomly extracted 100 entries from the final results, 47 of which were the words I wanted: This means that the correct rate of the word list is only about half. However, in the wrong entries, many of the items are split in the right, but there are several words stuck together. So in fact, we do not need to increase the screening criteria for more stringent filtering. The next word segmentation algorithm will solve the problem that the word has not been cut.


In addition, according to the dictionary's correct rate and the size of the dictionary, I calculated that the vocabulary of a dream of red mansions is about 16,000.



Viterbi algorithm to find the most efficient word segmentation scheme


Before the selection of words, the idea is to use a variety of numerical criteria to judge. And for "participle" this seemingly more difficult problem, the idea is similar: to develop a evaluation of the segmentation program scoring criteria, and then find the highest scoring of the segmentation scheme.


What are the grading criteria? The simplest criterion is to multiply the probability that each fragment is a word after the Shard, as the correct probability of the splitting scheme, which is the scoring standard. We assume that a fragment is the probability of a word, which is the frequency of the fragment in the original.


With the scoring criteria, there is another question: How to find the highest score of the segmentation scheme. Surely you can't try every solution one at a time, or it's too slow. We can use a mathematical method to simplify the calculation: the Viterbi algorithm.


The Viterbi algorithm is essentially a dynamic programming algorithm. The idea is this: for a part of the sentence, the best splitting scheme is fixed, does not change with context, and if you save this best segmentation solution, you can reduce a lot of repetitive computations. We can start with the first word, calculate the top two words, the first three words, the first four words ... The best segmentation scheme and save these schemes.


Because we are counting in turn, so whenever we add a word, we just try to split the position of the last word. The content in front of this position must have been calculated, so you can calculate the score by querying the previous splitting scheme.


When I construct the word list, I calculate how each fragment is like a word, which is a score. However, the following word segmentation algorithm only takes into account the frequency of the fragment, and does not use the fraction of the fragment. So I simply rudely added the score of the fragment to the algorithm: multiply the frequency of the fragment by the fraction of the fragment, as the weighted frequency. So that those more like the words of the fragment has a higher weight, it is easier to cut out.


There is also a small optimization. We know that the average Chinese word length does not exceed four words, so in the program enumeration method, only need to try the last four segmentation position. This limits the longest cut segment to four words, and also reduces the number of unnecessary attempts to long sentences.


Check the program run results found that the final program word segmentation algorithm accuracy is 85.71% (meaning is the location of the program incision is how many should be cut), the recall rate is 75% (meaning should be cut the location of how many are programmed to cut). This result does not look very high, because most open source word segmentation software accuracy rate can reach more than 90%, can even reach more than 97%. However, after all, I use no dictionary participle, and the algorithm is relatively simple, so I am still more satisfied.


But poetry participle is more difficult, the accuracy of the other parts of the lower than 10%. This is understandable, because there are many unusual words in poetry, and some words even appear only once, so it is difficult for computers to find information from the statistical data.



Statistics results said: Jia's People love "Laugh"


After the completion of participle, word frequency statistics is very simple. We just need to divide the fragments according to the result of the participle, remove the fragment of one length (that is, the word), and then count the number of each fragment.


This is the highest number of occurrences of the top 20 words:


Bao Yu (3940), laughed (2314), Fengjie (1521), what (1432), Jia Mu (1308), attacked (1144), one (1111), Dai Yu (1102), we (1068), Wang (1059), now (1016), Bao Chai ( 1014, listened to (938), came out (934), The Old Lady (908), you (890), went to (879), how (867), the wife (856), the Girl (856)


After the word frequency, we found that "a dream of Red mansions" in the characters from more to less in turn is Baoyu, Fengjie, Jia Mu, attacked people, Dai Yu, Mrs. Wang and Bao-Chai. However, this ranking is problematic, because "Lin Daiyu" the number of the word has 267 times, need to add to Dai Yu's play, so in fact, Dai Yu's play more than the attack.


Similarly, "old lady" generally refers to the Jia Mu, so the play of the MU is more than the Phoenix sister. The correct ranking should be Baoyu, Jia Mu, Fengjie, Dai Yu, assaulting people, Mrs. Wang and Bao-Chai.


In addition, we also found that "dream of Red Mansions" in the characters are very love to laugh, because in addition to the names of the most frequent occurrences of the word is "laughing":


I have a complete word frequency table made a Web page, interested can go to see: Red Mansion Thesaurus.

Finally finished the participle, and away from the goal of a big step. Now, I can use the PCA algorithm mentioned in the previous article to analyze the differences between the chapters. But before that, I would like to reflect on what words should be used to analyze the word frequency.


In a lot of use PCA analysis "dream of Red Mansions" in the blog, we are using the most frequent words to analyze. The problem is, however, that the most frequently used word is related to the change in the plot. In order to eliminate the impact of the changes in the plot, I decided to choose the words with the smallest change in word frequency as the characteristics of each chapter. The way I measure the frequency change is to count the word frequency in each return and then compute the standard variance. To eliminate the effect of the usual degree of word on the standard variance, I divide the standard variance by the average frequency of the word at each return, get the corrected variance, and then use this criterion to filter the feature words.


In the end, I chose the 50 words with the lowest frequency change as the feature, and each word was corrected with a standard variance of less than 0.85. Theoretically, with the characteristics, we can compare the similarities of each chapter. The problem, however, is that we now have 50 features, meaning that the data space is now 50-dimensional, which is difficult to visualize for humans who imagine that four-dimensional space is difficult. For the visualization of high-dimensional data, PCA is a very useful mathematical tool.


I use PCA to compress the 50 dimensions of the word frequency of 50 words into a two-dimensional plane. The compressed data points are drawn and found to be this way:

Each circle in the ∆ diagram represents a Huitian. Within the circle is the Huitian number, counting starting from 1. The red circle is 1-40 back, the green circle is 41-80 back, and the blue circle is 81-120 back.


The 80-back content (blue) is mostly concentrated in a narrow area in the lower left corner, clearly distinguishing itself from the other chapters. It is not true that the last 40 times of the dream of Red Mansions are written by the same author.


Don't worry, the analysis isn't over yet. An important advantage of PCA is that its analysis results are highly explanatory because we can know the weight of each original feature in the compressed feature. As you can see from the above figure, the main difference between the last 40 is the value of component two (component 2). So we can take a look at the frequency of each word in Component 2 weight ranking (weights in parentheses):


laughed (0.883), we (0.141), one (0.133), you (0.128), two (0.113), said (0.079), we (0.076), this (0.063), listened to (0.052), and (0.046), Side ( 0.045), Come (0.037), All (0.032), but (0.028), Go (0.027), not (0.025), go Out (0.021), such (0.018), now (0.016), here (0.016), not (0.011), see him ( 0.011), come Out (0.010), is (0.010), temporarily (0.008), Up (0.005), see (0.002), not (0.002), let's (0.000), not (-0.001), nor (-0.001), say (-0.002 ), the person (-0.005), I do not know (-0.007), there (-0.009), call him (-0.011), Dare (-0.011), Own (-0.011), not (-0.017), what (-0.019), so (-0.020), just (-0.023) ), Know (-0.026), come in (-0.036), say (-0.046), how (-0.050), only (-0.056), not (-0.077), hear (-0.092), Baoyu ( -0.312)


I found that the word "smile" is not only the most frequent words in addition to names, but also the weight of the PCA result is unusually high (0.88), even exceeding the absolute value of the weight of "Baoyu" (0.31). In order to understand why the word has such a large weight, I put "smile" the frequency of the changes are drawn out:

∆ figure in the horizontal axis is the chapter back number, Ordinate is "smile" of the word frequency


Can be found, "smile" the word frequency is first increased and then reduced, which makes me think of the rise and fall of the Jia House process. Murphy "laughs" the word frequency and the development situation of Jia Fu.


Interestingly, "smile" the word frequency peak appeared in the 50th, and some people from the point of view of the story that Jia's heyday began in the 48th, 49 times, coincide.


Perhaps the "smile" this seemingly ordinary vocabulary does reflect the fall of Jia's house. Although causality has yet to be verified, there is a point to think about, after all, only when the days are good, people will love to laugh.


The word "Smile" seems to have a bigger relationship with the plot, and it seriously affects our analysis. In addition, "Bao Yu" as a person's name, its weight of the absolute value is also relatively large, may also be affected by the plot. Therefore, I decided to put these two words "black", using the word frequency of the remaining 48 words to do the characteristics of the PCA analysis again.


I found that after the feature was modified, the last 40 back did not gather as before, but there was a tendency to gather. This shows that the PCA results were actually because of "laughing" and the plot interference.


and remove "laughed after 40 back still has the trend of aggregation, that remove interference after these chapters time is a certain similarity." Therefore, I am a little grasp that the "dream of Red Mansions" in the first 80 back and the following 40 back to the use of words are some differences. However, because it is difficult to completely exclude the impact of the plot, so I do not dare to make a conclusion.


Although not completely solve the dream of the red mansions is not the same person's problem, but the process of the discovery is also very interesting, such as "laughing" word frequency changes and Jia Fu fall interesting coincidence. More importantly, the seemingly boring mathematical formula can make these playful analyses.


The Math is fun.


Note: This article is the author of the analysis of a dream of Red mansions in Python edited version of the text, the pictures are from the author


This is the data man building, a tech-loving tech guy. Once the Oier, has now retreated the pit. Also interested in machine learning, web making and photography. We are now studying in the United States.




guess you like


Detailed explanation | How to implement machine learning algorithm in Python

Python 3 was embarrassed for so long that it was finally saved.

Experience | How to learn python efficiently.

A summary of learning Python's 14 mind Mapping

How to use LSTM network to predict time series in Python

Crazy-rising Python, developers should start with 2.x or 3.x.

2017, the first U.S.-China Data Science report, Python popularity ranked first, U.S. data workers in the median annual salary of up to 110,000 U.S. dollars


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.