Association mining algorithms: Apriori and FP-Tree Learning

Source: Internet
Author: User

Both the Apriori algorithm and the fptree algorithm are the Association Rule Mining Algorithms in Data Mining. They process the simplest single-dimension boolean association rules.

Apriori algorithm

The Apriori algorithm is the most influential algorithm used to mine frequent item sets of Boolean association rules. It is based on the fact that algorithms use a prior knowledge of the nature of frequent item sets. Apriori uses an iteration method called layer-by-layer search,K-Item Set for exploration (K+ 1)-item set. First, find the set of frequent 1-item sets. This set is recordedL1.L1Used to find a set of frequent 2-item setsL2, AndL2Used for searchingL3So on, until it cannot be found frequentlyK-Item set. Find eachLkA Database Scan is required.

The idea of this algorithm is simply to say that if set I is not a frequent item set, then all larger sets containing set I cannot be a frequent item set.

The original algorithm data is as follows:

TID

List of item_id's

T100

T200

T300

T400

T500

T600

T700

T800

T900

I1, I2, I5

I2, I4

I2, I3

I1, I2, I4

I1, I3

I2, I3

I1, I3

I1, I2, I3, I5

I1, I2, I3

The basic process of an algorithm is as follows:

First, scan all transactions, get 1-item set C1, filter out the item set that does not meet the condition according to the support requirement, and get the frequent 1-item set.

Perform the following recursive operations:

Known frequent K-item set (frequent 1-item set known), all possible k + 1 _ items are connected Based on the items in the frequent K-item set, and perform pruning (if all K-item Subsets of this k + 1 _ item set cannot meet the support condition, the K + 1 _ item set is cut off) to obtain the item set, filter out the items in the set that do not meet the support condition to obtain the frequent k + 1-item set. If the obtained item set is null, the algorithm ends.

Method of connection: Assuming that all items in the item set are arranged in the same order, if the pre-K-1 items in [I] and [J] are identical, if the k-th item is different, [I] and [J] are connectable. For example, {I1, I2} and {I1, I3} can be connected. After the connection, {I1, I2, I3} is obtained, but {I1, I2} and {I2, i3} is not connectable; otherwise, duplicate items will appear in the item set.

Here is an example of pruning. For example, during the generation process, the listed 3_item set includes {I1, I2, I3}, {I1, I3, I5 }, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}, but because of {I3, I4} and {I4, i5} does not appear in the middle, so {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5} has been removed.

The time-space complexity of the Apriori algorithm cannot be ignored in the case of massive data volumes.

Spatial complexity: if the number reaches the magnitude, the number of candidate items will reach the magnitude.

Time Complexity: the database needs to be scanned once every computing time.

FP-Tree Algorithm

Fptree algorithm: complete the functions of the Apriori algorithm without generating candidate items.

The basic data structure of the fptree algorithm includes a FP tree and an item header table. Each item points to its position in the tree through a node chain. The basic structure is as follows. Note that the item header table needs to be sorted in descending order of support. The nodes with high support in the fptree can only be the ancestor nodes of nodes with low support.

In addition, we need to explain several basic concepts in the fptree algorithm:

FP-tree: This is the preceding tree. It sorts the transaction data items in the transaction data table according to the support level, insert the data items in each transaction into a tree with null as the root node in descending order, and record the support degree of the node at each node.

Condition mode base: contains a set of prefix paths in the FP-tree that appear together with the suffix mode. That is, the set of ancestor paths of all nodes of the same frequent item in the PF tree. For example, I3 appears three times in the FP tree. The ancestor paths are {I2, I1: 2 (frequency: 2)}, {I2: 2}, and {I1: 2 }. The set of these three ancestor paths is the condition mode base of the frequent item I3.

Condition tree: A New FP-tree formed based on the construction principle of the condition tree. For example, in I3, the condition tree is:

 

1. Create an item header table: scan the database once to obtain the F set of frequent items and the support for each frequent item. Sort F in descending order of support, and mark it as l.

2. Construct the original fptree: sort the frequent items of each thing in the database in the order of L. And insert each frequent item of each thing into the fptree with null as the root according to the order after the arrangement. If a frequent item node already exists at the time of insertion, add 1 to the support level of the frequent item node. If the node does not exist, create a node with a support level of 1, link the node to the item header table.

3. Call FP-growth (tree, null) to start mining. The pseudocode is as follows:

Procedure Fp_growth(Tree,A)

If TreeContains a single pathPThen {

ForPathPEach combination of nodes inB)

Generation ModeBUAAnd its supportSupport=BThe minimum support of the node;

} Else {

For eachA IInTree(Scanning is performed in ascending order of support ){

Generate a ModeB=A IUAAnd its supportSupport=A I.Support;

StructureBAnd then constructBCondition FP-treeTreeB;

IfTreeThen is not empty for B

Call fp_growth (TreeB,B);

}

}

FP-growth is the core of the entire algorithm.

FP-growth function input: Tree refers to the original fptree or the conditional fptree of a certain pattern, and a refers to the suffix of the pattern (in the first call, A = NULL, in subsequent recursive calls, A is the pattern suffix)

Output of the FP-growth function: all modes and their support values are output during recursive calling (for example, the support for {I1, I2, I3} is 2 ). Each mode that calls the fp_growth output result must contain the pattern suffix entered by the fp_growth function.

Let's simulate the execution process of FP-growth.

1. In the first layer of FP-growth recursive call, A = NULL before and after the mode, which is actually the frequent 1-item set.

2. Call FP-growth () recursively for every 1-item of frequency to obtain the set of multiple frequent items.

The following two examples illustrate the execution process of FP-growth.

1. the conditional mode base of I5 is (I2 I1: 1), (I2 I1 I3: 1). the conditional FP-tree constructed by I5 is as follows. Then, call FP-growth recursively, and the mode suffix is I5. In this condition, the FP-tree is a single path. In fp_growth, all combinations of {I2: 2, I1: 2, I3: 1} are listed directly, and then obtain the Union with the mode suffix I5 to get all the modes with support> 2: {I2 I5: 2, i1 I5: 2, I2 I1 I5: 2 }.

2. I5 is relatively simple, because the conditional FP-tree corresponding to I5 is a single path, let's take a look at the slightly more complex situation I3. The condition mode of I3 is (I2 I1: 2), (I2: 2), (I1: 2), and the generated condition FP-tree is like the left, then recursively call FP-growth with the pattern prefix I3. The conditional FP-tree of I3 is still a multi-path tree. First, we take the Union of each item in the entry header table of the pattern suffix i3 and the conditional FP-tree, we get a set of modes {I2 I3: 4, i1 I3: 4}, but this set of modes is not all with the suffix I3. You also need to call FP-growth recursively. The mode suffix is {I1, I3}, and {I1, I3}. The condition mode base is {I2: 2 }, the generated condition FP-tree is shown on the right. This is a single-path condition FP-tree. In fp_growth, take I2 and the pattern suffixes {I1, I3} and get the pattern {i1 i2 I3: 2 }. Theoretically, the mode set with the suffix {I2, I3} should also be calculated. However, the condition mode base of {I2, I3} is null, And the recursive call ends. The final mode suffix I3 supports more than 2 in all modes: {I2 I3: 4, i1 I3: 4, i1 i2 I3: 2}

 

Based on the FP-growth algorithm, the final support> 2 frequent mode is as follows:

Item

Conditional mode basis

Conditional FP-tree

Frequently generated Modes

I5

I4

I3

I1

{(I2 I1: 1), (I2 I1 I3: 1)

{(I2 I1: 1), (I2: 1 )}

{(I2 I1: 2), (I2: 2), (I1: 2 )}

{(I2: 4 )}

<I2: 2, I1: 2>

<I2: 2>

<I2: 4, I1: 2>, <I1: 2>

<I2: 4>

I2 I5: 2, i1 I5: 2, I2 I1 I5: 2

I2 I4: 2

I2 I3: 4, i1 I3: 4, I2 I1 I3: 2

I2 I1: 4

The FP-growth algorithm is faster than the Apriori algorithm by an order of magnitude, and its spatial complexity is also optimized by an order of magnitude than that of the Apriori algorithm. However, for massive data volumes, the time-space complexity of FP-growth is still high. The recommended methods include database partitioning and data sampling.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.