The 12th Chapter uses the FP-GROWTH algorithm to find frequent itemsets efficiently
A Lead
The Fp-growth algorithm is an algorithm for discovering frequent itemsets, and it cannot be used to discover association rules. The special thing about the fp-growth algorithm is that it builds a fp tree and discovers frequent itemsets from the FP tree.
The fp-growth algorithm is faster than the Apriori algorithm and generally improves by two orders of magnitude because it only needs to traverse the database two times, and its process is divided into two steps:
1. Building The FP Tree
2. discovering frequent itemsets with the FP tree
Two FP Tree
FP Tree It is similar in shape to a normal tree, where nodes in the tree record an item and how often the item appears on that path. the FP tree allows the item to recur, but its frequency may be different. Recurring items are similar items, the connection between similar items is called a node connection, and along the nodes we can quickly find similar items.
an item in the FP tree must be a frequent item, that is, it must satisfy the Apriori algorithm. the FP tree construction process needs to traverse both sides of the database, the first time we count the frequency of all items, filtering out the items that do not meet the minimum support, and the second time we count the frequency of each itemsets.
Three Build FP Tree
1. to create a tree, first we need to define the data structure of the tree.
This data structure has five items, where the parent represents the parents, children represents the child node, and similar represents the similarity.
2. Create a Fp Tree
defCreatetree(DataSet, Minsup=1):
headertable = {}
# in order to catch the all the item and it ' s frequent
forTransactioninchDataSet:
forIteminchTransaction
Headertable[item] = headertable.get (item, 0) + dataset[transaction]
# Delete the item which is not frequent item
forKeyinchHeadertable.keys ():
offHeadertable[key] < minsup:
del(Headertable[key])
Frequentset = Headertable.keys ()
# If the frequentset is empty and then we can finish the program early
ifLen (frequentset) = = 0:
returnNone, none
# Initialize the begin link of headertable is None
forKeyinchHeadertable.keys ():
Headertable[key] = [Headertable[key], None]
# Initialize the Fp-tree
Rettree = TreeNode ("RootNode", 1, None)
# rearrange the transaction and add the transaction into the tree
forTransdata, TimesinchDataset.items ():
Arrangetrans = {}
forIteminchTransdata:
ifIteminchFrequentset:
Arrangetrans[item] = headertable[item][0]
ifLen (Arrangetrans) >0:
Sorttrans = [V[0] forVinchSorted (Arrangetrans.items (), key=LambdaP:P[1], Reverse=true)]
Updatetree (Sorttrans, Rettree, headertable, Times)
returnHeadertable, Rettree
defUpdatetree(Sorttrans, Rettree, headertable, times):
ifSorttrans[0]inchRettree.children:
Rettree.children[sorttrans[0]].inc (Times)
Else:
Rettree.children[sorttrans[0]] = TreeNode (Sorttrans[0], times, Rettree)
ifHEADERTABLE[SORTTRANS[0]][1] = = None:
HEADERTABLE[SORTTRANS[0]][1] = Rettree.children[sorttrans[0]]
Else:
Updateheader (Headertable[sorttrans[0]][1], Rettree.children[sorttrans[0]])
ifLen (Sorttrans) > 1:
Updatetree (sorttrans[1::], Rettree.children[sorttrans[0]], headertable, Times)
defUpdateheader(Nodetotest, TargetNode):
whileNodetotest.similarnode! = None:
Nodetotest = Nodetotest.similarnode
Nodetotest.similarnode = TargetNode
Four from a tree Mining frequent itemsets in the FP tree
from a tree Mining frequent itemsets in the FP tree requires three steps, first we need to find all the conditional schema base, followed by the conditional schema base to create a conditional FP tree, and finally we repeatedly repeat the first two steps until the tree contains only one element item
First we need to find the conditional pattern base, then what is the conditional pattern base? The so-called conditional pattern base is a collection of all paths that end with the Find element. We can find all the locations of an element in the tree based on nodelink in headertable , and then search for the prefix path based on those locations. The following is its python code:
Now that we know how to find the conditional pattern base, the next step is to create a FP condition tree. Before we create the tree, we first need to know what is the FP condition tree, the so-called fp Tree is an FP that constructs a conditional pattern base for a condition Tree. Here is the python code
Five An example of viewing news co-occurrence
There are 1 million data in Korasa.dat, and we need to know the frequent itemsets with minimum support of 100000 . If the use of the Apriori algorithm time is very long, I waited for a few minutes before the results, it does not wait.
then use this section of the the fp-growth algorithm uses only one Second more to be finished. The following is the specific code
Six Summarize
Fp-growth algorithm, as a special algorithm for discovering frequent itemsets, is more efficient than the Apriori algorithm. It only needs to scan the database two times, the first time to find headertable, that is, to find all the individual frequent items. The second time is to incorporate each transaction into the tree.
discovering frequent itemsets is a very useful operation, often needed, and we can use it for a variety of scenarios, such as search, shopping transactions, etc.
Fp-growth algorithm idea and its Python implementation