FPGROWTH Algorithm of association rules

Source: Internet
Author: User

Aprori algorithm uses two characteristics of frequent sets, filtering a lot of unrelated sets, the efficiency is improved a lot, but we found that the Apriori algorithm is a candidate elimination algorithm, each elimination needs to scan all data records, resulting in the entire algorithm in the face of large data set is powerless. Today we introduce a new algorithm mining frequent itemsets, the efficiency is much higher than the Aprori algorithm.

The fpgrowth algorithm compresses data records by constructing a tree structure, making mining frequent itemsets only need to scan two data records, and the algorithm does not need to generate candidate sets, so the efficiency is higher. Let's also be an example of the data set used in the previous article:

TID Items
T1 {Milk, bread}
T2 {bread, diapers, beer, eggs}
T3 {milk, diapers, beer, Coke}
T4 {Bread, milk, diapers, beer}
T5 {Bread, milk, diapers, coke}

First, the construction fptree

Fptree is a tree structure, and the tree structure is defined as follows:

public class Fpnode {    String idname;//ID number    list<fpnode> children;//child node    fpnode parent;//parent Node    Fpnode next;//the next ID number with the same node    long count;//Occurrences}

Each node of the tree represents an item, here we are not anxious to see the structure of the tree, we demonstrate the construction process of the fptree, fptree the structure of the natural understanding of the tree. Let's say we have a minimum absolute support of 3.

  Step 1: scan the data records, generate a frequent itemsets, and sort by the number of occurrences, as follows:

Item Count
Milk 4
Bread 4
Diapers 4
Beer 3

can see, eggs and Coke did not appear in the table, because Coke only appeared 2 times, eggs only appear 1 times, less than the minimum support, so not frequent itemsets, according to Apriori theorem, non-frequent itemsets is not a frequent item set, so Coke and eggs do not need to be considered.

 Step 2: scan the data record again for the items in each record that appear in the table that is generated in Step 1, sorted by the order in the table. Initially, a new root node is created, marked as null;

  1) first record: {Milk, bread}, sorted by step 1 table filtered to get still {milk, bread}, create a new node, idname {milk}, insert it under the root node and set count to 1, then create a new {bread} node, insert into {milk} Below the node, insert the following as follows:

  2) The second record: {bread, diapers, beer, eggs}, filtered and sorted after: {bread, diapers, beer}, found that the root node does not contain {bread} son (have a {bread} grandson but not son), so create a {bread} node, inserted under the root node, So the root node has two children, then the new {Diaper} node is inserted under the {Bread} node, the new {Beer} node is inserted under {diaper}, as follows:

  3) third record: {milk, diapers, beer, cola}, filtered and sorted after: {milk, diapers, beer}, this time found that the root node has a son {milk}, so do not need to create a new node, just the original {milk} node count plus 1, down to find {milk} The knot has a son {diaper}, so create a new {diaper} node, insert it below the {milk} node, and then insert the new {Beer} node behind the {diaper} node. Insert the following as shown:

  4) fourth record: {bread, milk, diapers, beer}, filtered and sorted after: {milk, bread, diapers, beer}, this time found that the root node has a son {milk}, so do not need to create a new node, just the original {milk} node count plus 1, down to find {milk} Node has a son {bread}, so do not need to create a new {bread} node, just the original {bread} node count plus 1, because this {bread} node has no son, this time need to create a new {diaper} node, inserted under the {Bread} node, then create a new {beer} node, inserted under the {Diaper} node, Insert the following as shown:

  

  5) fifth record: {bread, milk, diapers, cola}, filtered and sorted after: {milk, bread, diapers}, check found root node has {milk} son, {milk} node has {bread} son, {Bread} node has {diaper} son, This insertion does not require a new node to update count, as follows:

  

Following the above steps, we have basically constructed a fptree (frequent Pattern tree), the daily path in the tree represents an item set, because many itemsets have public items, and the more occurrences of the items are more likely to be father-in-law, so that the number of occurrences from the order of many to less can save space, The implementation of compressed storage, in addition we need a table header and each idname the same node to do a clue, convenient for later use, the structure of clues is also formed in the building process, but in order to simplify the fptree generation process, I did not mention in the above, this in the code is reflected, The fptree for adding clues and headers are as follows:

  At this point, the entire fptree is constructed, and in the following excavation we will see the function of the table header and clues.

Second, mining frequent itemsets with Fptree

After the Fptree is built, it is possible to dig the frequent itemsets, the mining algorithm is called the fpgrowth (frequent Pattern growth) algorithm, and the mining starts from the last item in the header header of the table.

  1) here, starting with {beer}, find all {beer} nodes based on the lead chain of {beer}, and then find the branch of each {beer} node: {Milk, bread, diapers, beer: 1},{milk, diaper, beer: 1},{bread, diaper, beer: 1}, of which "1" The expression appears 1 times, note that although {milk} appears 4 times, but {milk, bread, diapers, beer} appears only 1 times, so the count of the branch is determined by the suffix node {beer}, minus {beer}, we get the corresponding prefix path {milk, bread, diaper: 1},{milk, diaper : 1},{Bread, diaper: 1}, according to the prefix path we can generate a conditional fptree, constructed as before, where the data record becomes:

TID Items
T1 {milk, bread, diapers}
T2 {Milk, diapers}
T3 {Bread, diapers}

The absolute support level is still 3, and the resulting fptree are:

After the condition tree is constructed, the conditional tree is recursively mined, and when the condition tree has only one path, all the combinations of the paths are conditional frequent sets, assuming that {beer} is frequently set to {S1,S2,S3}, the frequent set of {beer} is {s1+{beer},s2+{beer},s3+{Beer}}, i.e. {beer The frequent set of} must have the same suffix {beer}, where the condition is frequently set to: {{},{diaper}}, so {beer} 's frequent set is {{beer} {diaper, beer}}.

  2) then find the last number of the second item {diaper} of the Header table, and the prefix path of {diaper} can be obtained as follows: {bread: 1},{milk: 1},{milk, Bread: 2}, the dataset for conditional Fptree is:

TID Items
T1 Bread
T2 Milk
T3 {Milk, bread}
T4 {Milk, bread}

Note {milk, bread: 2}, which is {milk, bread} count is 2, so in {milk, bread} repeated two times, so that the purpose is to use the previously constructed Fptree algorithm to construct the condition Fptree, but this efficiency will be reduced, imagine if {milk, bread} Count is 20000, then you need to expand into 20,000 records and then make 20,000 count updates, and in fact only need to update count once to 20000. This is achieved on the optimization details, in practice when attention. The conditions for construction are fptree:


This condition tree is already a single path, and all combinations on the path are conditional frequent sets: {{},{milk},{bread},{milk, bread}}, plus {diaper}, get a set of frequent itemsets {{Diaper},{milk, diaper},{bread, diaper},{milk, bread, diaper}}, This set of frequent itemsets must contain an identical suffix: {diaper}, and does not contain {beer}, so this set of frequent itemsets is not duplicated with the previous group.

Repeat the above steps, the header header of each item mining, you can get the entire frequent itemsets, it is possible to prove (rigorous algorithm and proof of the visible reference [1]), frequent itemsets are not duplicated or omitted.

The implementation code for the program is still on my GitHub, and here's a look at the results of the operation:

Absolute Support degree: 3 frequent itemsets: Bread diaper     3 diaper milk     3 milk     4 bread milk     3 Diaper Beer     3 bread     4

In addition I downloaded a shopping basket data set, the volume of data is large, testing the efficiency of the fpgrowth is good. The average efficiency of the fpgrowth algorithm is much higher than the Apriori algorithm, but it does not guarantee high efficiency, its efficiency depends on the data set, when there are no common items in the frequent itemsets in the dataset, all itemsets are hung on the root node, the compressed storage is not implemented, and the Fptree requires additional overhead. Need more storage space, before using the fpgrowth algorithm, analyze the data to see if it is appropriate to use the FPGROWTH algorithm.

Reference Document: Http://www.cnblogs.com/fengfenggirl/p/associate_fpgowth.html

FPGROWTH Algorithm of association rules

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.