The Apriori algorithm described earlier has many drawbacks, such as a large number of full-table scans and large computational natural connections, and is now almost no longer used
The PFP algorithm is used in the Mahout algorithm library, which is the distributed operation mode of Fpgrowth algorithm, and its internal algorithm structure and FPGROWTH algorithm difference are not very large
So here's the first introduction to the FPGROWTH algorithm that runs in stand-alone memory
Or use the Apriori algorithm's shopping cart data as an example, as shown in:
TID is the number of the shopping cart item, I1-I5 is the item number
The basic idea of the FPGROWTH algorithm is to scan the entire shopping cart data sheet first, calculate the support for each item, and sort from top to bottom, as shown in the table below.
Build the FP tree, starting with the minimum support level at the bottom
The build process is as follows:
Finally, the FP tree is built as
Associate this FP tree with a support scale such as:
Each item in the support table has a pointer to the corresponding node in the FP tree, for example, the first line points to I2:7, and the second line points to I1:4 because the I1 node also appears elsewhere in the FP tree, and a pointer to the I1:2 node is stored in the so-called I1:4 node
Building a good FP tree with a handful of full-table scans turns the cart's irregular data into a tree-like structure that can be traced, and eliminates the computational nature of a huge natural connection.
Mining Association rules from the FP tree:
Through the FP tree, we can get the corresponding conditional pattern base for each commodity, conditional FP tree and generated frequent pattern
such as i5
As you can see in the FP tree, there are two paths from the root node to the i5:1:
I2:7-->i1:4-->i5:1
I2:7-->i14-->i3:2-->i5:1
I2:7-->i1:4 and I2:7-->i14-->i3:2 are i5 conditional pattern bases, because the node that eventually arrives is definitely i5, so i5 is omitted
Remember {i2,i1:1}{i2,i1,i3:1}, why is the count of each conditional pattern base 1? Although the counts of I2 and I1 are large, the i5 count is 1, and the number of repetitions that eventually reach i5 is only 1. So the count of conditional pattern bases is determined by the minimum count of nodes in the path.
Depending on the conditional pattern base, we can get the conditional FP tree for that commodity, for example i5:
According to the conditions of the FP tree, we can do a full array of combinations, to get the frequent patterns excavated (here to the commodity itself, such as i5 also counted in, each commodity mining out of the frequent pattern must include the commodity itself)
The full table obtained according to the FP tree is as follows:
At this point, the result of the FPGROWTH algorithm output is the frequent pattern, the FPGROWTH algorithm uses the way of divide and conquer, and a potentially huge tree structure is constructed by constructing the conditional FP subtree, respectively processing
However, when the product data is very large, the FP tree built by the fpgrowth algorithm may be too large for the computer memory to load, and the distributed FPGROWTH,PFP algorithm will be used to compute it.
Reference book: "Data Mining concepts and technologies"
Mining Association rules of Data Mining Algorithm (ii) FPGROWTH algorithm