Association rule Mining Algorithm Fp-tree without generating candidate sets

Source: Internet
Author: User
Tags compact

The previous blog describes the idea of Apriori algorithm and Java implementation, http://blog.csdn.net/u010498696/article/details/45641719 Apriori algorithm is a classical association rule algorithm, However, as mentioned in the previous blog, it also has two fatal performance bottlenecks, one of which is that frequent set self-join generation candidate sets may produce a large number of candidate sets, and the other is to get frequent itemsets from the candidate set and need to scan the database repeatedly.

2000, Han, etc. proposed an algorithm called Fp-tree, effectively solves the above two problems, it only needs to scan the database 2 times, does not use the candidate set, by constructing a frequent pattern tree (pattern frequent tree,fp-tree), Compress all database information onto the frequent pattern tree and finally generate association rules from this tree.

Fp-tree algorithm main steps, 1: Using data in the database to construct fp-tree;2: Mining frequent patterns from Fp-tree. The following gives the algorithm description, and then illustrates.

--------------------------------------------------------------------------------------------------------------- ------------------------------------------

Algorithm: Fp-tree Mining frequent patterns

Input: Transaction database D; Minimum support threshold min_sup.

Output: Frequent mode

Method: 1. Construction Fp-tree

(a) scan the database for the first time, get frequent one-set F1, and sort f by support degree. -----In addition to sorting, this step and Apriori get frequent one set exactly the same.

(b) Create the Fp-tree root node, with the root tag, for each transaction in transaction database D, do the following:

Select the frequent items in each transaction and sort by support, set the list of frequently ordered items as [p| P], where p is the first element and P is the table of the remaining elements. Call Insert_tree ([p| P],root) inserts p into the tree until the frequent item table is empty.

2. Digging Fp-tree

Precedure Fp-tree (Tree,a)

If Tree contains a single path p then

Each combination of nodes in the For Path P (recorded as B) do

Generation mode BUa, where the minimum support of nodes in Support=b

Else for each AI in the Fp-tree Header table

Generates a pattern B = Aiua, which supports support = Ai.support;

Construct the conditional pattern base of B and then construct the condition of B fp-tree Tree B;

If tree B is not an empty set then call Fp_tree (tree b,b);

--------------------------------------------------------------------------------------------------------------- -------------------------------------------

Algorithm: Insert_tree ([p| P],root) will frequent items [p| P] inserted into the frequent pattern tree root

Input: Frequent items to insert [p| P] Fp-tree Tree root

Output: Fp-tree tree root

Method: if (Root has children child node makes child.name = P.name)

Then child.count++; Node Child support plus 1

Else

Create a new node, set its support to 1, link to its parent node root, and link to a node with the same node name;

--------------------------------------------------------------------------------------------------------------- --------------------------------------------

Obviously the fp-tree algorithm only needs to scan the database 2 times, the first time is to generate a frequent 1 itemsets, the second is based on the frequent 1 itemsets, each tuple in the database, the association and frequency information in its project set into Fp-tree.

In the construction of the frequent pattern tree, always place the occurrence of high frequency items closer to the root node, so that the construction of the tree is compact, in the frequent pattern of the mining process, is the last item in the Item Header table, that is, 1 frequently concentrated least frequent items to start mining, until the first item.

Fp-tree Tree Structure Advantages: 1 integrity, preserving the full information of frequent pattern mining, without destroying the long mode of any one transaction

2 compact, reducing non-related information (non-frequent items are discarded) is no larger than the original database.

In a Java implementation, the node class is defined as follows:

public class TreeNode  {    private String name;//node name    private int count;//Count    private TreeNode parent;// Parent node    private list<treenode> Children;//child node    private TreeNode nexthomonym;//The next node with the same name    //method slightly    ...}

The following example shows the data in the following database:


First sweep database, get the frequent 1 itemsets sorted by the support degree


The following constructs the Fp-tree, starting with the first row and adding the frequent items in each transaction to the tree.

A tree that is constructed after joining a tid=100 transaction. Join the A node first, once again b,d,e,f,g


For example, after the first row of data is added, a branch of the tree is formed, and the second line continues to be added below, a,f,g. Since there is already a node, you do not need to create a new node, you only need to change the original Node A's support level +1 to 2

When adding F, the child of a does not have an F node, so create a new F node with a support degree of 1, and continue adding the G node as the F node for the root. When you are finished adding the following:


And so on, add the remaining transactions to the tree, forming the following fp-tree




5 Transactions form A total of 4 branches, and the table head node forms a list of all nodes that share all node names. The following is the mining process:

First, starting from the least frequent frequent G-node, look for the conditional pattern base of g (composed of all paths from the root node to G that end with the G node in Fp-tree)

As shown: A total of 3 paths ending with G, calculating the support level for each node, and only a node greater than or equal to 3

So the conditional pattern base of the G-node is only Node A


In the search for node F, the conditional mode base is as follows: null


Find the conditional pattern base of the E D b a node in turn, and summarize the following:


Finally, the frequent patterns obtained are as follows

<a b> support of 3

<b d> support of 3

<b e> support of 3

<a g> support of 3

Summary: The Fp-tree method converts the problem of long-frequent patterns to recursive discovery of some short patterns and then joins with suffixes. It uses the least frequent entry as a suffix, providing good selectivity. This method greatly reduces the search overhead.

The main problems are

1) When mining frequent patterns, it needs to generate conditional fp-tree recursively, and each generation of a frequent pattern will generate a conditional fp-tree.2) When the support threshold is small, even for very small databases, a number of 100,000 frequent patterns will be generated. Dynamic generation and release numbers are fp-trees in 100,000 of conditions, which will consume a lot of time and space.3) In addition, Fp-tree and conditional fp-tree require top-down generation, while frequent pattern mining requires bottom-up processing. Because the conditional fp-tree is generated recursively, the fp-tree and the conditional fp-tree must be traversed in both directions. SuchMore pointers are needed in fp-tree and conditional fp-tree nodes, which requires a larger memory space to maintain fp-tree and conditional fp-tree. Java implementation of the Fp-tree algorithm visible blog: http://www.cnblogs.com/zhangchaoyang/articles/2198946.html

Reference "Data Mining concepts and Technologies" Jiawei Han Micheline Kamber

Association rule Mining Algorithm Fp-tree without generating candidate sets

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.