Data Mining Series (3) Evaluation of association Rules

Source: Internet
Author: User

The Association rules we discussed earlier are evaluated with support and confidence, and if a rule has a high level of self-confidence, we say it is a strong rule, but self reliability and support can sometimes not measure the actual meaning of the rule and the interest point of the business concern.

A strong rule that misled us.

Looking at an example, we analyze the relationship between buying a game disc in a shopping basket data and buying a video disc. The trading dataset has 10,000 records, 6,000 of which contain game discs, 7,500 contain video discs, 4,000 contain both game discs and video discs. The dataset is shown in the following table:

Let's say we have a minimum support level of 30% and a minimum confidence level of 60%. From the above table, you can get: Support (Buy a game CD-> buy a video disc) =4000/10000=40%,confidence (Buy a game CD-> buy a video disc) =4000/7500*100%=66%. The support and confidence of this rule meet the requirements, so we are very excited, we found a strong rule, so we suggest that the supermarket to put the video disc and the game CD together, can increase sales.

However, we think that a person who likes to play the game will have time to see the film, this rule is not a problem, in fact, this rule misled us. The probability of buying a film in the entire dataset p (buy the film) =7500/10000=75%, while the person who buys the game also buys the movie probability only 66%,66%<75% precisely explained that buys the game CD-ROM to restrain the movie CD's purchase, namely buys the game disc the person to prefer not to buy the film CD-ROM, This is realistic.

From the example above, we see that support and self-confidence do not successfully filter out the rules that we are not interested in, so we need some new evaluation criteria, the following six evaluation criteria: correlation coefficient, card-side index, full confidence, maximum self-confidence, KULC, cosine distance.

Correlation coefficient lift

From the above examples of games and movies, we can see that the game and the movie are not positively related, so using the correlation Metric association rules can filter such rules for rules a->b or B->a,lift (a,b) =p (a) (P (a) *p (b)) if Lift ( A,B) >1 indicates positive correlation between A and B, lift (a,b) &lt;1 indicates a negative correlation between a and B, and lift (a,b) =1 indicates that a, B is irrelevant (independent). In practical use, positive correlation and negative correlation are all we need to pay attention to, and independence is often we do not need, two goods have no mutual influence is not strong rule, lift (a,b) equals 1 situation also very few, generally as long as close to 1 we think is independent.

Note that the correlation coefficient can only determine the correlation, the correlation is not causal, so the correlation coefficients of the a->b or B->a two rules are the same, in addition lift (A,B) =p (a B)/(P (a) *p (B)) =p (a) *p (b| A)/(P (a) *p (B)) =p (b| A)/P (b) =confidence (a->b)/support (b) =confidence (B->a)/support (a).

Chi-Square coefficient

Chi-Square distribution is an important distribution in mathematical statistics, and we can determine whether two variables are correlated by using chi-square coefficients. Definition of chi-square coefficient:

The observed in the formula represents the actual value of the data, expected expresses the expectation, does not understand is not OK, we see an example to understand.

The parentheses in the table above indicate the expected value, (Buy film, buy a game) expectations e=6000* (7500/10000) = 4500, the overall record of 75% people to buy the film, and the game has 6000 people, so we expect that 6000 of 75% (that is, 4500) of the people to buy the film. The other three values can be computed similarly. Now we calculate the chi-square coefficient of buying games and buying films:

Chi-Square coefficient x= (4000-4500) ^2/4500+ (3500-3000) ^2/3000+ (2000-1500) ^2/1500+ (500-1000) ^2/1000=555.6.

The chi-square coefficients need to be checked to determine the value, based on confidence levels and degrees of freedom (r-1) * (c-1) = (Row-1) * (column-1) =1, the table has a confidence (1-0.001) value of 6.63, 555.6 greater than 6.63, thus rejecting A, b independent assumption, that a, B is relevant, and expected (buy film, buy game) =4500>4000, so that a, B is negatively correlated. A certain probability statistic is needed here. If you find it difficult to understand, you can use other evaluation criteria.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.