Data Mining Series (1) the basic concept and aprior algorithm of association rule Mining

Last Update:2017-02-27 Source: Internet

Author: User

Tags empty

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I plan to organize the basic concepts and algorithms of data mining, including association rules Mining, classification, clustering of common algorithms, please look forward to. Today we are talking about the most basic knowledge of association rule mining.

Association rules mining has been widely used in electric business, retailing, atmospheric physics and Biomedicine, this article will introduce some basic knowledge and aprori algorithm.

The story of beer and diapers has become a classic case of association rule Mining, and a book called "Beer and diapers", although the story was coined by the Harvard Business School, is a good explanation for the principles of Mining Association rules. Here we use a supermarket basket mini dataset to explain the basic concepts of association rule mining:

Each row in the table represents a purchase list (note that you purchase 10 cartons of milk only once, that is, to record only the appearance or otherwise of a product). The collection of all items in a data record is called the total item set, and the total item in the table above is set s={milk, bread, diapers, beer, eggs, cola.

Definition of association rule, confidence degree and self-sustaining degree

Association rules are associated rules that are defined in this way: two disjoint non-empty sets X, Y, and if there is a x-->y, say X-->y is an association rule. For example, in the above table, we find that buying beer will definitely buy diapers, {beer}-->{diaper} is an association rule. The strength of association rules is described by the support degree (support) and the confidence degree (confidence).

Definition of support: Support (x-->y) = | X and y|/n= the number of times/data records that a collection x and the items in set y appear in a single record. For example: Support ({beer}-->{diaper}) = number of simultaneous occurrences of beer and diapers/data records = 3/5=60%.

Definition of self-reliability: confidence (x-->y) = | X and y|/| X| = Number of occurrences/set X in a record for the items in the collection X and in the set Y. For example: Confidence ({beer}-->{diaper}) = number of concurrent beers and diapers/number of times beer appears =3/3=100%;confidence ({diaper}-->{beer}) = number of concurrent beers and diapers/diaper occurrences = 3/ 4 = 75%.

Here the definition of support and self-confidence are relative support and self-confidence, not absolute support, absolute support abs_support = data record number N*support.

The higher the degree of support and confidence, the stronger the rule, the mining of association rules is the rule to meet a certain strength.

The definition and steps of mining Association rules

Association rule Mining Definition: Given a transaction dataset T, find out the association rules of all the support degrees support >= Min_support, confidence degree confidence >= min_confidence.

There is a simple and rude way to find the rules that are needed, that is all the combination of the exhaustive set, and tests whether each combination satisfies the condition, and the number of items with an element of N is 2^n-1 (excluding the empty set), and the time complexity required is obviously O (2^n), for ordinary supermarkets, The item set number of the commodity is also above 10,000, and the algorithm with exponential time complexity cannot solve the problem in an acceptable time. How to dig fast the association rules satisfying the condition is the main problem to be solved in the association mining.

Think about it, and we'll find that for {beer--> diaper},{diaper--> Beer} The support of these two rules actually only needs to calculate the support of the {diaper, beer}, that is, the degree of support they intersect. So we put the association rule mining in two steps:

1) Generating frequent itemsets

This phase identifies all set of items that meet the minimum degree of support, which is called the frequent item set.

2) Generating Rules

A rule that satisfies the minimum confidence level is generated on the basis of the frequent itemsets generated in the previous step, and the resulting rules are called strong rules.

The time spent in mining association rules is mainly on generating frequent itemsets, because the frequent itemsets often do not find many, the use of frequent itemsets to generate rules will not spend too much time, and generate frequent itemsets need to test a lot of alternatives set, if not optimized, the time required is O (2^n).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More