Data mining-learning notes: Mining Association Rules

Last Update:2014-10-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Concepts

Association Rule Mining: discovering interesting and frequent patterns, associations, and correlations between item sets of a large amount of data, such as the food database and relational database.

Measurement of the degree of interest of association rules:Support,Confidence

K-item set: a set of K items

Frequency of the item set: number of transactions that contain the item set

Frequent Item Set: if the frequency of the item set is greater than the minimum support * Total number of transactions, the integration is a frequent item set.

Ii. Classification of Association Rule Mining

1. Type of values processed in the rule: boolean association rules and quantified Association Rules

2. Data dimensions involved in rules: Single-dimension association rules and multi-dimensional Association Rules

3. abstraction layer involved by rules: single-layer association rules and multi-layer Association Rules

4. Expansion Based on associated mining: mining the largest frequent patterns and Mining Frequent closed item sets

Iii. Association Rule Mining Process in large databases

1. Find out all frequent item sets. Most calculations are concentrated in this step.

2. Strong association rules are generated by frequent item sets, that is, rules that meet the minimum support and minimum confidence level

4. Algorithm for Finding frequent item sets: Apriori algorithm

Using the prior knowledge (prior knowledge) of the frequent item set, Apriori algorithm uses the layer-by-layer search iteration method to calculate the K-item set (k + 1 ,, to exhaust all frequent item sets in a dataset.

To improve the effciency of the level-wise generation of frequent itemsets, an important property calledApriori propertyIs used to reduce the search space.

Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

Steps:

1.The join step:In order to calculate LK, a set of candidate K-item sets is generated by connecting the Lk-1 with itself. This K-item set is called ck.

The two elements L1 and L2 in the Lk-1 can perform the join operation on the condition that

The frequent set in CK is LK.

2.The prune step:Reduce the computing workload by using Apriori property.

Algorithm: Apriori. Find frequent itensets using an iterative level-wise approach based on cadidate generation.

Input:

D, a database of transaction;

Min_sup, the minimum support count threshold.

Output:L, frequent itemsets in D.

Method:

L1 = find_frequent_1-itemsets (d );
For (k = 2; Lk-1! = NULL; k ++ ){
Ck = apriori_gen (Lk-1 );
For each transaction t belont to d {
Ct = subset (CK, t );
For each candidate C belong to CT
C. Count ++;
}
Lk = {C belong to CK | C. Count> =Min_sup}
}
Return L = uklk;

Procedure custom ori_gen (Lk-1: frequent (k-1)-itemsets)
For each itemset L1 belong to Lk-1
For eachitemset L2 belong to Lk-1
If (L1 [1] = L2 [1] & L1 [2] = L2 [2] &... & L1 [K-2] = L2 [K-2] & L1 [k-1] <L2 [k-1]) Then {
C = L1 join L2; // join Sep: Generate candidates
If has_infrequent_subset (C, Lk-1) then
Delete C; // prune step: Remove unfruitful candidate
Else add C to CK;
}
Return CK;

Procedure has_infrequent_subset (C: candidate K-itemset; Lk-1: frequent (k-1)-itemsets); // use prior knowledge
For each (k-1)-subset S of C
If s not belong to Lk-1 then
Return true;
Return false;

Disadvantages of the Apriori algorithm:

1. perform multiple scans on the data;

2. generate a large number of candidate item sets;

3. Tedious Calculation of support for candidate sets

Solution:

1. Reduce the number of scans;

2. Reduce the candidate set;

3. Improved support Calculation Method

Method 1: hash-based technique

Map each item set to different buckets of the hash object through the hash function. In this way, you can first remove some items by comparing the item set count in the bucket with the minimum supported count.

Method 2: Transaction Functions

Transactions that do not contain any k-item set cannot contain k + 1-item set. Therefore, such an item set can be marked or removed from the considered item set.

Method 3: partitioning

Method 4: Sampling

Method 5: dynamic itemset counting

The main overhead of the Apriori algorithm is to generate a large number of candidate frequent item sets. The FP-tree algorithm can detect frequent patterns without generating candidates.

Data mining-learning notes: Mining Association Rules

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data mining-learning notes: Mining Association Rules

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data mining-learning notes: Mining Association Rules

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support