Data Mining Series (1) the basic concept and aprior algorithm of association rule Mining

Source: Internet
Author: User
Tags empty

I plan to organize the basic concepts and algorithms of data mining, including association rules Mining, classification, clustering of common algorithms, please look forward to. Today we are talking about the most basic knowledge of association rule mining.

Association rules mining has been widely used in electric business, retailing, atmospheric physics and Biomedicine, this article will introduce some basic knowledge and aprori algorithm.

The story of beer and diapers has become a classic case of association rule Mining, and a book called "Beer and diapers", although the story was coined by the Harvard Business School, is a good explanation for the principles of Mining Association rules. Here we use a supermarket basket mini dataset to explain the basic concepts of association rule mining:

Each row in the table represents a purchase list (note that you purchase 10 cartons of milk only once, that is, to record only the appearance or otherwise of a product). The collection of all items in a data record is called the total item set, and the total item in the table above is set s={milk, bread, diapers, beer, eggs, cola.

Definition of association rule, confidence degree and self-sustaining degree

Association rules are associated rules that are defined in this way: two disjoint non-empty sets X, Y, and if there is a x-->y, say X-->y is an association rule. For example, in the above table, we find that buying beer will definitely buy diapers, {beer}-->{diaper} is an association rule. The strength of association rules is described by the support degree (support) and the confidence degree (confidence).

Definition of support: Support (x-->y) = | X and y|/n= the number of times/data records that a collection x and the items in set y appear in a single record. For example: Support ({beer}-->{diaper}) = number of simultaneous occurrences of beer and diapers/data records = 3/5=60%.

Definition of self-reliability: confidence (x-->y) = | X and y|/| X| = Number of occurrences/set X in a record for the items in the collection X and in the set Y. For example: Confidence ({beer}-->{diaper}) = number of concurrent beers and diapers/number of times beer appears =3/3=100%;confidence ({diaper}-->{beer}) = number of concurrent beers and diapers/diaper occurrences = 3/ 4 = 75%.

Here the definition of support and self-confidence are relative support and self-confidence, not absolute support, absolute support abs_support = data record number N*support.

The higher the degree of support and confidence, the stronger the rule, the mining of association rules is the rule to meet a certain strength.

The definition and steps of mining Association rules

Association rule Mining Definition: Given a transaction dataset T, find out the association rules of all the support degrees support >= Min_support, confidence degree confidence >= min_confidence.

There is a simple and rude way to find the rules that are needed, that is all the combination of the exhaustive set, and tests whether each combination satisfies the condition, and the number of items with an element of N is 2^n-1 (excluding the empty set), and the time complexity required is obviously O (2^n), for ordinary supermarkets, The item set number of the commodity is also above 10,000, and the algorithm with exponential time complexity cannot solve the problem in an acceptable time. How to dig fast the association rules satisfying the condition is the main problem to be solved in the association mining.

Think about it, and we'll find that for {beer--> diaper},{diaper--> Beer} The support of these two rules actually only needs to calculate the support of the {diaper, beer}, that is, the degree of support they intersect. So we put the association rule mining in two steps:

1) Generating frequent itemsets

This phase identifies all set of items that meet the minimum degree of support, which is called the frequent item set.

2) Generating Rules

A rule that satisfies the minimum confidence level is generated on the basis of the frequent itemsets generated in the previous step, and the resulting rules are called strong rules.

The time spent in mining association rules is mainly on generating frequent itemsets, because the frequent itemsets often do not find many, the use of frequent itemsets to generate rules will not spend too much time, and generate frequent itemsets need to test a lot of alternatives set, if not optimized, the time required is O (2^n).

Related Keywords:

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

• Sales Support

1 on 1 presale consultation

• After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

• Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.