Introduction to DM: Summary of Apriori

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Apriori algorithm: Use candidate items to find frequent item sets

The Apriori algorithm is a basic algorithm in association analysis. It is used to mine frequent item sets of Boolean association rules. Principle: uses the prior knowledge of frequent item sets and layer-by-layer search Iteration Methods to explore (k + 1) item sets using K item sets. Here we will first look at the two-dimensional Apriori algorithm. (Generally, databases are two-dimensional .. Hehe)

Apriori: all non-empty subsets of a frequent item set must also be frequent. (Anti-monotonic)

Algorithm Description:
1. Link step: Generate candidate K item set C <k> by using L <k-1> Self-join.
2. pruning step: C <k> is the superset of L <k>. Using the Apriori property, if a K-item set's (k-1) item set is not in L <k-1>, the K-item set is removed from C <k>. (This subset test can be completed quickly using the hash tree of all frequent item sets)

On the pseudo code:

Apriori
Input: transaction data D; minimum support threshold value min_sup.
Output: frequent item set L in D.

L <1> = find_frequent_1-itemset (d );
For (k = 2; L <k-1>! = Empty; k ++ ){
C <k> = apriori_gen (L <k-1>, min_sup );
For each transaction t in D {
C <t> = subset (C <k>, T); // find the subset of all c <k> in T
For each candidate C in C <t>
C. Count ++;
}
L <k >={ C in C <k> | C. Count> = min_sup}
}
Return L = set of L <k>

Procedure connector ori_gen (L <k-1>, min_sup)
For each itemset L1 in L <k-1>
For each itemset l2 In L <k-1>
If (L1 [I] = L2 [1]) & (L1 [2] = L2 [2]) &… & (L1 [K-2] = L2 [K-2]) & (L1 [k-1] <L2 [k-1]) Then {
C = L1 * L2; // Union
If has_infrequent_subset (C, L <k-1>) then
Delete C;
Else
Add C to C <k>;
}
Return C <k>;

Procedure has_infrequent_subset (C, L <k-1>)
For each (k-1)-subset S of C
If s not in L <k-1> then
Return true;
Return false;

Association rules generated by frequent item sets:

Confidence (A => B) = P (A | B)
= Support (AB)/support (B) = support_count (AB)/support_count (B)

The non-empty subset of L is generated for each frequent item set l found in the previous Apriori. (remember the nature of the Apriori)
For each non-empty subset S, if support_count (L)/support_count (s)> = min_conf, "s => (l-S)" is output (why support_count (l) no (l-S )?)

How can I improve the effectiveness of the Apriori? Improve the data structure and reduce the number of scans.

1. Counting of hash items increases efficiency.
A simple analogy: Use hashmap <key, Val> to store in Java .. There is also a kind of trie tree. I will talk about algorithms in C ++ later ..

2. Transaction compression: transactions that do not contain K-item sets cannot contain (k + 1) items. In this way, you can add a delete tag to the transaction, which will not be considered in the next iteration.

3. Division (ER .. It is not an equivalence class. It is a convenient and disorderly drag of the program design. For example, you can drag the number of memory loaded at a time)
Two scans. For the first time, divide transactions in D into N non-overlapping parts. If the minimum supported threshold value of transactions in D is min_sup, the minimum supported count for each part is (min_sup * the number of transactions in this part ). Find the local frequent item set for each part.
A local frequent item set may not be a frequent item set of the entire database D, but any frequent item set of D must be a local frequent item set (Why ?). All local frequent item sets and are global candidate item sets. The second scan D evaluates the actual support of each candidate.

4. Sample Selection: Sampling + detection means to change the speed with precision.

5. Dynamic item set count: Unlike the new candidate set that can only be identified before each complete scan of the product, the new candidate set is now added at any start point. (How to implement programming better ?)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to DM: Summary of Apriori

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Introduction to DM: Summary of Apriori

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support