Fp-growth algorithm

Last Update:2015-03-16 Source: Internet

Author: User

Tags dashed line

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The FP_GROWTH algorithm is an association analysis algorithm proposed by Jiawei Han teacher in 2000, and the maximum difference between the algorithm and Apriori algorithm is two points:

First, do not produce candidate sets, second, only need two times to traverse the database, greatly improve the efficiency, with 31,646 test records, the minimum support is 2,

Using the Apriori algorithm for half an hour but with the fp_growth algorithm as long as 6 minutes can be, the efficiency is very obvious.

Its core is fp_tree, a tree-type data structure, characterized by the same elements as far as possible with a node representation, which greatly reduces the space, and the birch algorithm has similar ideas. Take the following data as an example.

Each line represents a trade, a total of 9 lines, both 9 trades, the left represents the transaction ID, the right side represents the product name. Minimum support is 22%, then each item must appear at least 9*22%=2 times to be counted frequently. Scan the database for the first time, count the number of occurrences of each item, by the number of times for each commodity descending order, there are:. Then the second scan of the database, in each transaction in this order to sort the goods, if there is a product appears less than the threshold value of 2, then delete the product, there are:

The rest is the construction of the Fp_tree, which is the core, the structure of each node of the tree is as follows:

Storage structure of Fp-tree
typedef struct csnode{
Product number
int item;
Number
int count;
Parent node, child node, sibling node
Csnode *parent,*firstchild,*nextsibling;
The predecessor of the same commodity, the successor node, convenient to connect the nodes of the same commodity, the two pointers of the direct child node of the root node are empty
Csnode *pre,*next;
}*cstree;
Where item,*firstchild,*nextsibling is a common attribute of the tree structure. Count records the number of times a product item appears, and *parent is set to facilitate the reverse access of the root node from the leaf node. *pre,*next's comments are clear. The principle of the construction tree is that each record is treated as a path from the root node to the leaf node, if an item already exists in the node, the corresponding count counter plus 1, the equivalent of all the prefixes are added 1, if not present in the record after the product opened a new path. The following article shows how to construct a fp_tree.

Third time to access the database, construct Fp_tree. The first record: I2,i1,i5, there are:

The parent node is not represented, and the root node is an empty node. 2:1 indicates that item 2 has appeared 1 times, and other representations and analogies. The array on the left is descending in order of the goods, saving the current pointer of each commodity, in order to find the same suffix in the back, connecting the same goods with a single arrow dashed line, which is actually a two-way chain link, and the node at this time goods 1 and Node item 5 is saved as the current pointer of commodity 1 and commodity 5. The current pointer to item 2, item 3, and item 4 is also stored in the left-hand array. Note that the root node of the direct children do not have to connect, the latter will explain the reason. The second record: I2,i4, there are:

The record and the first record share the prefix I2, so the number of items 2 will be added 1, and commodity 4 as a new child of commodity 2 node, here does not draw the brothers node. And the left goods 4 to point to the node, at this time the current pointer of item 4 points to the node item 4. Third record: I2,i3, similar, the result is:

Fourth record: I2,i1,i4:

When the product 4 is added, the current pointer of item 4 is to point to the new node item 4, at which point the two red dashed lines connect the node with the suffix of item 4. 5th record: I1,i3, there are:

Item 1 There is a different path for all direct children of the root node (only the child of product 2 here). The current pointer to item 3 is to point to the new node. Item 3, the yellow dotted line in the point, to here reflects the general structure of the fp_tree. Add the rest of the records and the final fp_tree are:

This fp_tree the maximum amount of the same product to be stored with the same node, to maximize the space savings. The rest of the job is to dig up the fp_tree.

The purpose of

Mining is to find the same collection in each path of Fp_tree, in two ways, one by traversing the tree from the root node toward the leaf node, and in the way two, from the leaf node to the root node. Think of the way a very troublesome, fortunately we set the *parent pointer, through it can be very convenient to use the way two. We start from the number of product occurrences from less to more of the tree, starting from the product 5, because of the *pre,*next pointer to the convenience of all the product 5 as a leaf node path to find out, and then according to the *parent pointer to find the parent node, the root node is empty do not find. The conditional pattern base for element I5 is: {(I2 i1:1), (I2 I1 i3:1)}. The following 1 indicates the number of simultaneous occurrences of the commodity i2,i1,i5. Now explain why: the direct child of the root node is not connected with the *pre,*next pointer, because if it is connected, there will be no prefix when it is suffixed, meaning that its frequent itemsets are 1, which in most cases is meaningless. It constructs the conditional fp_tree, noting that each item in the conditional pattern base is sorted in this way because it begins to sort by commodity name. If an item in the conditional pattern base is a subset of a second B, then in the case of B, the number of occurrences of a is added, the simplest and most straightforward way to achieve this function is one by one matches, if the conditional pattern base has n items, the time complexity is the square of N, if the length of the conditional pattern base is incremented first: {(I2 i1:1 ), (I2 I1 i3:1)}, the time complexity of the sorting is N*log (n), then only the length of the item is likely to be a subset of long items, when the total number of matches is: N-1 + N-2 +,,, + 1 = N (N-1)/2, and the preceding sort time adds up: N*log (n) + N * (N-1)/2 when n is greater than 4 o'clock, the value is less than the square of N. In practice, n is generally greater than 4. In the end we get the frequent itemsets with I5 as suffixes: {I2 i5:2},{i1 i5:2},{i2 I1 i5:2} They appear more than or equal to the minimum support level. Similar to the frequent itemsets where other suffixes can be obtained.

The fp_growth algorithm does not produce candidate sequences and only needs to traverse the database 3 times, which is greatly improved compared with the Apriori algorithm. Actually think this also conforms to the law of history Development, Apriori in 1993 years just put forward, that is the data mining just start, and by 2000, already had certain development, fp_growth is standing on Apriori's shoulder to invent, this kind of phenomenon is universal.

There are two kinds of common itemsets mining algorithms, one is Apriori algorithm and the other is fpgrowth. Apriori through the continuous construction candidate set, filter candidate set mining frequent itemsets, need to scan the original data many times, when the original data is large, disk I/O too many times, inefficient. The fpgrowth algorithm simply scans the original data two times and compresses the raw data through the Fp-tree data structure, which is more efficient.

Maybe someone would ask? If the database is large enough that the built FP tree is too large to be fully stored in memory, this is a good thing. This is really a problem. Han Jiawei in the paper also gives a way of thinking, is to partition the original large database into a few small database (this small database called the projection database), the few small databases of the FP growth algorithm.
Take the above example, we put all the database records containing p into a single database, we call it a P-projection database, similar to the m,b,a,c,f we can generate the corresponding projection database, the projection database is a relatively small size of the FP tree, It can be put in memory completely.
In the modern data Mining task, the data volume is more and more big, so the demand of parallelization is more and more big, the problem raised above is more and more urgent. Next blog, the blogger will analyze how FP growth is parallelized in the framework of MapReduce.
[1] Mining frequent Patterns without candidate Gen

Fp-growth algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More