Data mining-learning notes: Mining Association Rules

Source: Internet
Author: User
I. Concepts

Association Rule Mining: discovering interesting and frequent patterns, associations, and correlations between item sets of a large amount of data, such as the food database and relational database.

Measurement of the degree of interest of association rules:Support,Confidence

K-item set: a set of K items

Frequency of the item set: number of transactions that contain the item set

Frequent Item Set: if the frequency of the item set is greater than the minimum support * Total number of transactions, the integration is a frequent item set.

Ii. Classification of Association Rule Mining

1. Type of values processed in the rule: boolean association rules and quantified Association Rules

2. Data dimensions involved in rules: Single-dimension association rules and multi-dimensional Association Rules

3. abstraction layer involved by rules: single-layer association rules and multi-layer Association Rules

4. Expansion Based on associated mining: mining the largest frequent patterns and Mining Frequent closed item sets

Iii. Association Rule Mining Process in large databases

1. Find out all frequent item sets. Most calculations are concentrated in this step.

2. Strong association rules are generated by frequent item sets, that is, rules that meet the minimum support and minimum confidence level

4. Algorithm for Finding frequent item sets: Apriori algorithm

Using the prior knowledge (prior knowledge) of the frequent item set, Apriori algorithm uses the layer-by-layer search iteration method to calculate the K-item set (k + 1 ,, to exhaust all frequent item sets in a dataset.

To improve the effciency of the level-wise generation of frequent itemsets, an important property calledApriori propertyIs used to reduce the search space.

Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

Steps:

1.The join step:In order to calculate LK, a set of candidate K-item sets is generated by connecting the Lk-1 with itself. This K-item set is called ck.

The two elements L1 and L2 in the Lk-1 can perform the join operation on the condition that

The frequent set in CK is LK.

2.The prune step:Reduce the computing workload by using Apriori property.

 

Algorithm: Apriori. Find frequent itensets using an iterative level-wise approach based on cadidate generation.

Input:

D, a database of transaction;

Min_sup, the minimum support count threshold.

Output:L, frequent itemsets in D.

Method:

L1 = find_frequent_1-itemsets (d );
For (k = 2; Lk-1! = NULL; k ++ ){
Ck = apriori_gen (Lk-1 );
For each transaction t belont to d {
Ct = subset (CK, t );
For each candidate C belong to CT
C. Count ++;
}
Lk = {C belong to CK | C. Count> =Min_sup}
}
Return L = uklk;

Procedure custom ori_gen (Lk-1: frequent (k-1)-itemsets)
For each itemset L1 belong to Lk-1
For eachitemset L2 belong to Lk-1
If (L1 [1] = L2 [1] & L1 [2] = L2 [2] &... & L1 [K-2] = L2 [K-2] & L1 [k-1] <L2 [k-1]) Then {
C = L1 join L2; // join Sep: Generate candidates
If has_infrequent_subset (C, Lk-1) then
Delete C; // prune step: Remove unfruitful candidate
Else add C to CK;
}
Return CK;

Procedure has_infrequent_subset (C: candidate K-itemset; Lk-1: frequent (k-1)-itemsets); // use prior knowledge
For each (k-1)-subset S of C
If s not belong to Lk-1 then
Return true;
Return false;

 

Disadvantages of the Apriori algorithm:

1. perform multiple scans on the data;

2. generate a large number of candidate item sets;

3. Tedious Calculation of support for candidate sets

Solution:

1. Reduce the number of scans;

2. Reduce the candidate set;

3. Improved support Calculation Method

Method 1: hash-based technique

Map each item set to different buckets of the hash object through the hash function. In this way, you can first remove some items by comparing the item set count in the bucket with the minimum supported count.

Method 2: Transaction Functions

Transactions that do not contain any k-item set cannot contain k + 1-item set. Therefore, such an item set can be marked or removed from the considered item set.

Method 3: partitioning

Method 4: Sampling

Method 5: dynamic itemset counting

The main overhead of the Apriori algorithm is to generate a large number of candidate frequent item sets. The FP-tree algorithm can detect frequent patterns without generating candidates.

 

Data mining-learning notes: Mining Association Rules

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.