I. Concepts
Association Rule Mining: discovering interesting and frequent patterns, associations, and correlations between item sets of a large amount of data, such as the food database and relational database.
Measurement of the degree of interest of association rules:Support,Confidence
K-item set: a set of K items
Frequency of the item set: number of transactions that contain the item set
Frequent Item Set: if the frequency of the item set is greater than the minimum support * Total number of transactions, the integration is a frequent item set.
Ii. Classification of Association Rule Mining
1. Type of values processed in the rule: boolean association rules and quantified Association Rules
2. Data dimensions involved in rules: Single-dimension association rules and multi-dimensional Association Rules
3. abstraction layer involved by rules: single-layer association rules and multi-layer Association Rules
4. Expansion Based on associated mining: mining the largest frequent patterns and Mining Frequent closed item sets
Iii. Association Rule Mining Process in large databases
1. Find out all frequent item sets. Most calculations are concentrated in this step.
2. Strong association rules are generated by frequent item sets, that is, rules that meet the minimum support and minimum confidence level
4. Algorithm for Finding frequent item sets: Apriori algorithm
Using the prior knowledge (prior knowledge) of the frequent item set, Apriori algorithm uses the layer-by-layer search iteration method to calculate the K-item set (k + 1 ,, to exhaust all frequent item sets in a dataset.
To improve the effciency of the level-wise generation of frequent itemsets, an important property calledApriori propertyIs used to reduce the search space.
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
Steps:
1.The join step:In order to calculate LK, a set of candidate K-item sets is generated by connecting the Lk-1 with itself. This K-item set is called ck.
The two elements L1 and L2 in the Lk-1 can perform the join operation on the condition that
The frequent set in CK is LK.
2.The prune step:Reduce the computing workload by using Apriori property.
Algorithm: Apriori. Find frequent itensets using an iterative level-wise approach based on cadidate generation.
Input:
D, a database of transaction;
Min_sup, the minimum support count threshold.
Output:L, frequent itemsets in D.
Method:
L1 = find_frequent_1-itemsets (d );
For (k = 2; Lk-1! = NULL; k ++ ){
Ck = apriori_gen (Lk-1 );
For each transaction t belont to d {
Ct = subset (CK, t );
For each candidate C belong to CT
C. Count ++;
}
Lk = {C belong to CK | C. Count> =Min_sup}
}
Return L = uklk;
Procedure custom ori_gen (Lk-1: frequent (k-1)-itemsets)
For each itemset L1 belong to Lk-1
For eachitemset L2 belong to Lk-1
If (L1 [1] = L2 [1] & L1 [2] = L2 [2] &... & L1 [K-2] = L2 [K-2] & L1 [k-1] <L2 [k-1]) Then {
C = L1 join L2; // join Sep: Generate candidates
If has_infrequent_subset (C, Lk-1) then
Delete C; // prune step: Remove unfruitful candidate
Else add C to CK;
}
Return CK;
Procedure has_infrequent_subset (C: candidate K-itemset; Lk-1: frequent (k-1)-itemsets); // use prior knowledge
For each (k-1)-subset S of C
If s not belong to Lk-1 then
Return true;
Return false;
Disadvantages of the Apriori algorithm:
1. perform multiple scans on the data;
2. generate a large number of candidate item sets;
3. Tedious Calculation of support for candidate sets
Solution:
1. Reduce the number of scans;
2. Reduce the candidate set;
3. Improved support Calculation Method
Method 1: hash-based technique
Map each item set to different buckets of the hash object through the hash function. In this way, you can first remove some items by comparing the item set count in the bucket with the minimum supported count.
Method 2: Transaction Functions
Transactions that do not contain any k-item set cannot contain k + 1-item set. Therefore, such an item set can be marked or removed from the considered item set.
Method 3: partitioning
Method 4: Sampling
Method 5: dynamic itemset counting
The main overhead of the Apriori algorithm is to generate a large number of candidate frequent item sets. The FP-tree algorithm can detect frequent patterns without generating candidates.
Data mining-learning notes: Mining Association Rules