Introduction to DM: Summary of Apriori

Source: Internet
Author: User

Apriori algorithm: Use candidate items to find frequent item sets

The Apriori algorithm is a basic algorithm in association analysis. It is used to mine frequent item sets of Boolean association rules. Principle: uses the prior knowledge of frequent item sets and layer-by-layer search Iteration Methods to explore (k + 1) item sets using K item sets. Here we will first look at the two-dimensional Apriori algorithm. (Generally, databases are two-dimensional .. Hehe)

Apriori: all non-empty subsets of a frequent item set must also be frequent. (Anti-monotonic)

Algorithm Description:
1. Link step: Generate candidate K item set C <k> by using L <k-1> Self-join.
2. pruning step: C <k> is the superset of L <k>. Using the Apriori property, if a K-item set's (k-1) item set is not in L <k-1>, the K-item set is removed from C <k>. (This subset test can be completed quickly using the hash tree of all frequent item sets)

On the pseudo code:

Apriori
Input: transaction data D; minimum support threshold value min_sup.
Output: frequent item set L in D.

L <1> = find_frequent_1-itemset (d );
For (k = 2; L <k-1>! = Empty; k ++ ){
C <k> = apriori_gen (L <k-1>, min_sup );
For each transaction t in D {
C <t> = subset (C <k>, T); // find the subset of all c <k> in T
For each candidate C in C <t>
C. Count ++;
}
L <k >={ C in C <k> | C. Count> = min_sup}
}
Return L = set of L <k>

Procedure connector ori_gen (L <k-1>, min_sup)
For each itemset L1 in L <k-1>
For each itemset l2 In L <k-1>
If (L1 [I] = L2 [1]) & (L1 [2] = L2 [2]) &… & (L1 [K-2] = L2 [K-2]) & (L1 [k-1] <L2 [k-1]) Then {
C = L1 * L2; // Union
If has_infrequent_subset (C, L <k-1>) then
Delete C;
Else
Add C to C <k>;
}
Return C <k>;

Procedure has_infrequent_subset (C, L <k-1>)
For each (k-1)-subset S of C
If s not in L <k-1> then
Return true;
Return false;

Association rules generated by frequent item sets:

Confidence (A => B) = P (A | B)
= Support (AB)/support (B) = support_count (AB)/support_count (B)

The non-empty subset of L is generated for each frequent item set l found in the previous Apriori. (remember the nature of the Apriori)
For each non-empty subset S, if support_count (L)/support_count (s)> = min_conf, "s => (l-S)" is output (why support_count (l) no (l-S )?)

How can I improve the effectiveness of the Apriori? Improve the data structure and reduce the number of scans.

1. Counting of hash items increases efficiency.
A simple analogy: Use hashmap <key, Val> to store in Java .. There is also a kind of trie tree. I will talk about algorithms in C ++ later ..

2. Transaction compression: transactions that do not contain K-item sets cannot contain (k + 1) items. In this way, you can add a delete tag to the transaction, which will not be considered in the next iteration.

3. Division (ER .. It is not an equivalence class. It is a convenient and disorderly drag of the program design. For example, you can drag the number of memory loaded at a time)
Two scans. For the first time, divide transactions in D into N non-overlapping parts. If the minimum supported threshold value of transactions in D is min_sup, the minimum supported count for each part is (min_sup * the number of transactions in this part ). Find the local frequent item set for each part.
A local frequent item set may not be a frequent item set of the entire database D, but any frequent item set of D must be a local frequent item set (Why ?). All local frequent item sets and are global candidate item sets. The second scan D evaluates the actual support of each candidate.

4. Sample Selection: Sampling + detection means to change the speed with precision.

5. Dynamic item set count: Unlike the new candidate set that can only be identified before each complete scan of the product, the new candidate set is now added at any start point. (How to implement programming better ?)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.