Using the FP-GROWTH algorithm to efficiently discover frequent itemsets preface
You used the search engine to find such a function: Enter a word or a part of the word, Search engine cocktail automatically fill the query terms, the user even do not know whether the search engine recommended things exist, but will go to find the recommended terms, such as Baidu input "why" to start the query, will appear such as " Why do I have a body-changing device that can't change? "Such ridiculous recommendations, in order to give these referral inquiries kindly, the search engine Company's researchers used this article to introduce an algorithm, they see the word on the Internet to find often in a piece of the word pair, which requires an efficient method of discovering frequent sets. This algorithm is called Fp-growth, also known as the fp-growth algorithm, it is faster than the Apriori algorithm, it is based on Apriori construction, but in the completion of the same task with a number of different techniques. Unlike the Apriori algorithm's "Generate-Test", the task here is to store the dataset in a specific called FP tree structure after discovering frequent itemsets or frequent item pairs, that is, the set FP tree of element items that are often present in a single occurrence, and that the algorithm executes faster than Apriori, Performance is usually better than two orders of magnitude.
FP Tree notation
The FP tree is a compressed representation of the input data that is constructed by reading the transaction one after the other and mapping the transaction to a path in the FP tree, because different transactions may have several identical items, so their paths may partially overlap. The more overlapping paths overlap, the better the compression that is obtained using the FP tree structure, and if the FP tree is small enough to be stored in memory, you can extract the frequent itemsets directly from the in-memory structure without having to repeatedly scan the data stored on the hard disk.
Shows a dataset that contains 10 transactions and 5 items. (a business can be intuitively understood as a supermarket customer shopping records, we use algorithms to explore those items or items are frequently purchased by customers.) )
The structure of the FP tree after three transactions is drawn, and the FP tree that is finally built, initially, the FP tree contains only one root node, with symbolic null tags, and then expands the FP tree as follows:
- Scan the data once, determine the support count for each item, discard the non-frequent items, and sort the frequent items in descending order of support, for the data set given above, A is the most frequent, followed by B,c,d and E
- The algorithm scans the data set for the second time, builds the FP tree, reads the first transaction {a, B}, creates a node marked as a, and then forms the path of the null->a->b, which encodes the transaction, and the frequency of all nodes on that path is 1.
- After the second transaction {b,c,d} is read, a new node set is created for the item b,c,d, and then the connection null->b->c->d forms a new node set, forming a path that represents the transaction, and the frequency count for each node of the path is equal to 1. Although the first two transactions have a common item B, their paths do not intersect because the two transactions do not have a common prefix
- The third transaction, {a,c,d,e}, shares a common prefix entry a with the first transaction, so the path null->a->c->d->e of the third transaction overlaps the path null->a->b part of the first transaction because their paths overlap, So the frequency count of Node A increases to 2, and the frequency count of newly created nodes c,d and E equals 1.
- Continue the process until each transaction is mapped to a path in the FP tree, and the FP tree is formed after all transactions are read into
In general, the size of the FP tree is smaller than uncompressed data because the transaction for the basket data often shares some common items, and in the best case, all transactions have the same set of items, the FP tree contains only one node path, and when each transaction has a unique set of items, the worst case occurs because the transaction does not contain any common items. The size of the FP tree is actually the same as the size of the original data, however, because additional space is required for each item to hold pointers and techniques between nodes, the storage requirements of the FP tree increase.
The FP tree also contains a list of pointers that connect nodes with the same items, which are then represented by dashed lines, which helps to quickly access the items in the tree.
Frequent itemsets generation of FP growth algorithm
Fp-growth is a bottom-up approach to the tree, generated by the FP tree frequent itemsets algorithm, given the above-built FP tree, the algorithm first finds a frequent itemsets ending with E, followed by B,c,d, and finally a, because each transaction is mapped to a path in the FP tree, Because by examining only the paths that contain a particular node (for example, E), a frequent itemsets ending in E can be found, using pointers associated with node E, you can quickly access these paths, display the extracted paths, and later explain in detail how to handle these paths to get frequent itemsets.
The diagram above demonstrates the decomposition of the problem caused by frequent itemsets into multiple sub-problems, each of which involves discovering frequent itemsets with the end of E,d,c,b and a.
After discovering the frequent itemsets ending with E, the algorithm further finds the frequent itemsets ending with d by processing the path associated with node D, continuing the process until all the paths associated with node c,b and a are processed, and the above diagram shows the paths of these items, and their corresponding frequent itemsets are summarized in the following table
FP growth use a divide-and-conquer strategy to decompose a problem into smaller sub-problems, discovering all frequent itemsets ending with a particular suffix. For example, suppose you are interested in discovering all the frequent itemsets ending with E, in order to do this, you must first check whether the itemsets {e} itself is frequent, and if it is trivial, consider discovering frequent collection problems with the de ending, followed by CE and AE, in turn, each sub-problem can be further divided into smaller sub-problems , by merging the results of these sub-problems, we can find all the frequent itemsets ending with E, which is the key strategy adopted by the FP growth algorithm.
To more specifically explain how to resolve these sub-problems, consider discovering all the tasks of frequent itemsets ending with E.
- The first step collects all paths that contain the E node, which is called the prefix path, as shown in a.
- The prefix path shown in Figure a shows the support count for E by adding the number of support degrees associated with node E. The minimum support is assumed to be 2 because the support level of {e} is 3 so it is a frequent itemsets
Because {e} is frequent, the algorithm must resolve sub-problems that discover frequent itemsets ending with de,ce,be and AE, before solving these problems, the prefix path must first be converted to the conditional FP tree, in addition to the frequent itemsets that are used to find the end of a particular suffix, the structure of the conditional FP tree is similar to the FP tree, The conditional FP tree is obtained by following these steps.
- First, the support count on the prefix path must be updated because some counts contain transactions that do not contain item E. For example, the rightmost path, Null->b:2->c:2->e:1, contains a transaction {B,C} that does not contain item E, so you must adjust the count on the prefix path to 1 to reflect the actual number of transactions that contain {b,c,e}.
- Delete the nodes of E, trim the prefix path, delete these nodes because the support count along these prefix paths has been updated to reflect those transactions that contain E, and the de,ce,be,ae of the frequent itemsets that end with A/d is no longer required for node e information.
- After updating the support count on the prefix path, some items may no longer be frequent, for example, Node B appears only once, and its support count equals 1, which means that only one transaction contains both B and E, because all itemsets with be endings must be non-frequent, so you can safely ignore B in future analysis.
- The conditional FP tree for E is displayed in B, and the tree looks different from the original prefix path because the frequency count has been updated and nodes B and e have been deleted.
- FP growth uses the conditional FP tree of E to resolve sub-problems found in frequent itemsets with de,ce,be, and AE endings, in order to find frequent itemsets ending with DE, to collect all the prefix paths of D from the conditional FP tree of item E, by summing the frequency count associated with node D, to get the itemsets {d,e} Count of support degrees. Because the itemsets {d,e} support count is equal to 2, it is a frequent itemsets, and next, the algorithm constructs the conditional FP tree of the de with the method in the previous step. After updating the support count and removing the non-frequent item C, the de's conditional FP is displayed in D because the conditional FP tree contains only one item A with a support degree equal to the minimum support, the algorithm extracts {a,d,e} and goes to the next sub-problem, resulting in a frequent itemsets at the end of the CE, and after processing the prefix path of C, only C,e} is frequent, next, the algorithm continues to resolve the next sub-problem and discovers that the itemsets {a,e} are the only frequent itemsets left.
This example explains the partition method used in the FP growth algorithm, each recursive, to build the conditional FP tree by updating the support count in the prefix path and deleting the non-frequent items, because the sub-problem does not intersect, so FP growth does not produce any duplicate itemsets, in addition, The support count associated with a node allows the algorithm to count support when generating the same suffix.
FP growth is an interesting algorithm that shows how to use the compressed representation of a transactional dataset to efficiently generate frequent itemsets, and for some transactional datasets, the FP growth algorithm is a few orders of magnitude faster than the standard Apriori algorithm, and the performance of the FP growth algorithms depends on the "compression factor" of the DataSet. If the resulting FP tree is very lush (and, in the worst case, a fully binary tree), the performance of the algorithm is significantly reduced because the algorithm must produce a large number of sub-problems and need to merge the results returned by each sub-problem
Fp-growth algorithm