Data Mining algorithm: Correlation Analysis II (APRIORI)

Last Update:2017-12-11 Source: Internet

Author: User

Tags new set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Two. Apriori algorithm

As mentioned above, most association rule mining algorithms typically employ a strategy that is decomposed into two steps:

　　Frequent itemsets are created with the goal of discovering all itemsets that meet the minimum support threshold, called frequent itemsets (frequent itemset).

Rules are produced with the goal of extracting high-confidence rules from the frequent itemsets obtained in the previous step, called strong rules (strong rule). Often the calculation required for the generation of frequent itemsets is much larger than the calculation cost generated by the rule.

　　One of the original ways we find frequent itemsets is to determine the degree of support for each candidate set in the lattice structure. But the workload is relatively large. There are also several ways to reduce the computational complexity of generating frequent itemsets.

Reduce the number of candidate itemsets. such as the transcendental (Apriori) principle, is a method of removing some candidate itemsets without calculating the support degree.
Reduce the number of comparisons. Reduce the number of comparisons by using more advanced data structures or storing candidate sets or compressing datasets.

1. Algorithm analysis

The Apriori algorithm, the first mining Algorithm for association rules, pioneered the use of support-based pruning techniques to control exponential growth of candidate itemsets. The Apriori algorithm produces frequent itemsets in two steps: First, to find all the frequent itemsets in the current candidate set: Second, a new set of candidates with length plus 1 is generated with the current length of the frequent itemsets.

First, let's take a look at the two important properties used in the core principle used by the Apriori algorithm:

If an item set is frequent, then all of its subsets are frequent.

If an item set is non-frequent, then all of its superset is non-trivial. The strategy of trimming index search space based on support measures is called Support-based pruning , which relies on a property, that is, the support degree of an itemsets never exceeds its own support degree, which is called inverse monotonicity of scale metric. (Anti-monotone).

If an item set is a non-frequent item set, then the superset of the itemsets will not need to be considered. Because if the itemsets are non-frequent, then all of its superset must be non-frequent. A superset of an item set is a set of items that contains the elements of the set and the number of elements. {Milk,beer} is one of the superset of {Milk} in the shopping basket transaction library. This principle is well understood, if {Milk} appears 3 times, {Milk,beer} will appear together less than 3 times. So if the support of an item set is less than the minimum support threshold, then its superset of support must be less than this threshold value, it is no longer considered.

Below is a brief description of how all frequent itemsets are identified by the Apriori algorithm in the shopping basket example.

First, we limit the minimum support count to 3. Traversing the itemsets of length 1, it is found that {Coke} and {Eggs} do not meet the minimum support count and remove them. Generate = 6 candidate sets of length 2 with the remaining 4 frequent itemsets with a length of 1. Again based on the recalculation of the support count, found that {Bread, Milk} and {Milk, Beer} These two itemsets are non-frequent, removing them before producing a candidate set of length 3. It is important to note that there is no need to generate {Milk, Beer, diaper} This candidate set, because one of its subsets {Milk, Beer} is non-frequent, according to a priori principle the set itself must be non-frequent.

2. Evaluation of advantages and disadvantages:

The advantage of the Apriori algorithm is that it can produce a relatively small set of candidates, and its disadvantage is that the database is scanned repeatedly, and the number of scans is determined by the maximum number of items in the project set, so Apriori is suitable for a relatively small set of data with the largest frequent itemsets.

Using the hash tree structure to improve the efficiency of the Apriori algorithm to produce candidate sets:

In the above-mentioned Apriori algorithm, we have learned that this algorithm needs to continuously produce candidate sets from frequent itemsets. First find all the elements in the contained transaction, and then the candidate sets that produce the length. The process is inefficient, and a hash tree is used to improve the efficiency of finding all candidate sets.

Data Mining algorithm: Correlation Analysis II (APRIORI)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More