Association between "beaten" and "Peking University"-one of the interesting data mining series (Tang changjie)
(Note: This is the first article in the 12 interesting data mining series on the science blog. I discussed it with my friends. The blog post is not a formal publication. I only moved the first article in the series, it can be used to publicize the science network. It is not a foul. There are several other links later)
When I was a child, I liked reading interesting data, so I had a long wish to write a group of popular science blog posts on interesting data mining. Some concepts of data mining should be discussed in a simple and interesting way. A good example is required. An interesting example that is suitable for interpreting association rules is coming up in the search process.
Zhou Tao, Lu Tao, and Cheng Zhi, the three bloggers on the science internet, learned about "Wolf Dad" in blog 1, article 2, and article 3, the three children were sent to Peking University for qualitative analysis. This document uses this example to describe the concept of support, confidence, and interest in association rules in Data Mining. By the way, we will make a quantitative analysis on this issue and also serve as the beginning of a series of interesting data mining blog posts.
This association rule can be written as follows:
R1: beaten --> peking university, supported s = ?, Confidence Level c =? Or reverse R2: Peking University --> beaten, support s = ?, Confidence Level c =? The observed causal angle is different from that of R1)
The upper limit of the calculation support and confidence level is as simple as below. Some rough assumptions and estimates with slight amplification are adopted.
1 support)
The number of people in the national college entrance examination is about 10 million: 2008, 10.5 million: 2010, 9.57 million). The support for the three "Wolf Dad" children is three times that of Peking University in the same year ), assuming that the university entered Peking University in the same year and all experienced "beaten" experience, the support for 3 K names increased by about 3 K times)
As a result, the support s for "being beaten" and "entering Peking University" among national candidates is:
Support s = 3 K/107 = 3 K * 10-7
The story of Wolf's father shows that k is greater than or equal to 1 here. It is estimated that K is less than 10 according to common sense (if K is magnified rashly, the Peking University Student Union will protest. Fortunately, this is just a negative assumption), so:
Support s <3*10-6 support has no causal direction, applicable to R1 and R2)
For such events with a relatively small probability, mature lottery players will only be used as entertainment, which is not worthy of the media's fuss.
2Computing"Peking University-> confidence)
2.1 computing within Peking University
The calculation of the confidence level of Rule R1 "beaten --> peking university" is a little difficult, and it should be resolved in section 2.2. Let's calculate the confidence level of R2: "Peking University --> beaten". It also shows a certain correlation, with 14000 million undergraduates. On average, 3500 students are collected each year, if 3 k members are less than or equal to k <10) compared with the parent, and no more than 3470 members are beaten, then:
Peking University --> beaten, with a confidence level of 3 K/3500 <0.86%
Peking University --> not beaten, confidence level: 3470/3500> 99.14%
It can be seen that the association between "beaten" and "Peking University" is very small. You cannot believe it.
2.2 calculate the confidence of "beat --> peking university ):
For example, assume that N people were beaten nationwide in the same year, of which 3 k names entered Peking University are estimated as above, 0 ≤ k <10) Then R1: beaten --> peking university, confidence Level = 3 k/N. If N is large and k is greater than 0, the confidence level is relatively small and it is not easy to estimate the specific value of N, but N is not expected to be large, which is the tragedy of education ), if N is not too big, K> 0, the confidence level is relatively large. If k is 0 in a year, no matter how big N is, the confidence level of "beaten"> "Peking University" in that year is 0.
2.3 calculate the interest level of the proposed rule within the family:
"Wolf Dad" has four children who don't know why they are super-active.) it is estimated that four of them have been beaten, and three of them have gone to Peking University.
Beat-> peking university, support 0.75, confidence 0.75. 1) This rule will not be established once it leaves its home. Therefore, the accurate expression is: this is the family, beat) --> peking university, support 0.75, confidence 0.75. 2) To demonstrate its insignificance, we can also find a really relevant association rule: this child, who eats every day) --> peking university, with 0.75 support and 0.75 confidence level. 3) If you change "eat every day" to any health care product, the association rules are also established, which is more attractive than "hit", and may have economic benefits. This meaningless association rule describes the degree of interest that needs to be introduced. This concept is a little complicated and only introduces its general idea. When there are multiple items on the left of the association rule, as shown in the preceding 3) formula, you can use the subtraction method to test the contribution of each item. This way, patients with allergic diseases can judge the allergens, and the left side can even be reduced to a blank set. In Formula 3, a) Remove "eat every day" without reducing the support and confidence level. (B) Remove "This child, this is equivalent to mining big data sets across the country. The support and confidence level are greatly reduced immediately, indicating that this item is crucial. If every item in an association rule is important, this association rule is basically meaningful.
3. Incorrect mining conclusionThere are several estimates here. (1) The so-called "playing" is actually a strict pronoun that is raised high and put down gently, it's not the kind of attacks that beat people to beat their dignity and confidence. It's a tragedy); (2) the boss is sensible; (3) the boss's influence on the third child is far better than that of his father. When mining association rules, "Wolf Dad" ignores this factor. "Father, and daughter) is awesome, I made the "No interesteness" error. This is a slightly complicated concept.) I got the wrong mining conclusion.4. An association rule with high support and confidence
In the Error Correction Technology of input texts, we often focus on the relationship between words and the pronunciation of words, or homophonic associations. The Mandarin Pronunciation of "beaten" and "Peking University" is "beida ", when using the Pinyin input method, the two are easy to confuse. For example, in this blog post, when you enter the last subtitle "identification takes seven years, I once entered the "talent" as "talent" Thank you for your correction on the 22nd floor.) The correction software will sort the close-Tone Words in an approximate order. Because in terms of voice similarity:
Beat-> peking university, support 100%, confidence 100%
As a result, when "beaten" is input using the pinyin Method for input error correction check, the Top 1 among the candidate words listed by the software is "Peking University ", it may be a comfort for middle school students after being beaten. This technology is also useful in processing network texts and microblog mining, such as standardizing "Tragedy Vs Cup" and "p2p. p-to-P "," U. "YOU", and many Internet acronyms.5 beer diapers
In the past, when talking about association rules, there were three main points in the story of beer diapers:
(A) appearance analysis: Based on the abstract sales data mining, Wal-Mart found that beer and diapers are often purchased by male customers at the same time, and several strip such as Xi --> Yi, s =? C =? ;
(B) Internal connection is not data mining, but management.) The survey found that the father of the baby buys beer he loves when buying diapers for his children after work;
(C) Promotion Measures are a means of promotion) put beer and diapers on the same shelf, or further reduce the price of beer, increase the price of diapers, and attract the consumption of the father of the baby. People now think that this is just a story. Perhaps, the example of "Wolf Dad" is closer and easier to eliminate misunderstanding of concepts.
6. Data Mining is effective even though it is helpless to guess the mysteries of nature..
Before people grasp the laws of planetary movements, they can find patterns and match them from historical observation data. Digu is an experimental astronomy. after 40 years of observation, it has accumulated a large amount of data on planetary movements.
In the forty-year data of Digu, kepura used hand-work data mining to mine data for ten years and discovered three major laws of planetary motion. Candida Ferreira adopts the GEP method of Gene Expression Programming. With 10 individuals, 50 generations of evolution, much less data is needed, and several seconds can be completed. See the document [1], P253-257 ). With this law, data mining is no longer needed to calculate the location of a planet, but the formula is used directly. Therefore, data mining is helpless when you have to guess the mysteries of nature without knowing the rules.
Nowadays, there are still many unsolved mysteries of nature. Although data mining is helpless, It is very effective. After finding the correct expression formulas and laws, then try to use theory or model for dynamic or constructive interpretation.
The above analysis shows that data mining can discover surprising results from some common facts. Therefore, some countries regard data mining as sensitive majors. When students studying Data Mining abroad go to apply for a study visa, they are often checked, reviewed, and occasionally heard of being denied.It takes seven years to identify. When the three children of "Wolf Dad" entered Peking University, they could not be said to have succeeded. In the future, they would also need to do scientific research, find a job, or study and write papers ..., It will be successful if they have more competition.
There is a saying: it takes seven years for the trial Jade to be fully burned for three days. I hope that they will be able to become talented in seven or ten years. The success at that time has nothing to do with the current "fight.
Boyou has raised a question and asked how to mine association rules? Q: How can I use association rules for applications? And wait for the next article to break down.
References
[1] Candida Ferreira, "Gene Expression Programming, Mathematical Modeling by an Artificial Intelligence", Second, revised and extended edition, P253-257, Springer, 2006, ISSN print edition: 1860-949X, ISSN electronic edition: 1860-9503, Library of Congress Control Number: 2006921791.
Related blog posts
1. association between "beaten" and "Peking University"-one of the interesting data mining Series
2 simple association between roast duck, noodle cake and sweet Noodle Sauce-interesting data mining Series 2
3. It cited tens of thousands of paper papers and data consensus-interesting data mining 3
4. Skill Mining: the average hitting formula of a science blog and the intervention rules-interesting data mining 4
5. Let my mom tell the story of the past, house separation and classification-5-story about interesting data mining
6. Let's use the Water Margin story to explain the decision tree idea-6 to interesting data mining
7 clustering at banquet-seven from interesting data mining
8 rural middle school migration site selection, K-mean clustering, and laying hens paradox-8 of interesting data mining
9 riddle, alien colony, yugong mountain migration and evolutionary computing-9 to interesting data mining
10 Darwin, Mendel, and old yugong umeng: Genetic Expression Programming-10 of interesting data mining
11. Top 10 algorithms and top 10 problems-11 interesting data mining
12 interesting philosophy in Data Mining-12 of interesting data mining
Entrance to other blog series-Tang changjie blog homepage-science blog homepage