Distributed Fp-tree
1. First, the shopping basket data is sorted, counted, assuming min_sup=3. Remove items with a support level of less than 3.
2. According to Fp-tree's drawing, the Fcamp,fcabm,fb,cbp,fcamp of the second column, the establishment of Fp-tree as follows:
3. The third column is a right-to-left traversal of the second column to obtain a path to a point, for example, the path to P is Fcam, to M is the FCA, to a is FC, to C is f, the process occurs on the map side, the basket data is stored on each node, resulting in the third column as shown in the <k,v>
4. Through the shuffle process, sent to the reducer, it is easy to find the frequent mode on the reduce side
In order to verify the above results, mining frequent patterns with Fp-tree:
P: The P count on the first path is 2, less than Min_sup, then all items with a count of 2 are removed (f,a,m), c,p appear on the rightmost path 1 times, plus p:2,c:2 on the first path, and finally P:3,c:3
M: The first path on the f2-c2-a2-m2, the second path f1-c1-b1-m1, filtered out B, a total of f3-c3-a3-m3, the final mode is: F:3,c:3,a:3,m:3
B: I can't dig anything out.
Similarly:
A:f:3,c:3,a:3
C:f:3,c:3
The fate of the certificate.
PFP Algorithm Bottleneck:
On the reduce side, mapshuffle all data on the reduce side, which also makes it easy to burst the memory of the reduce node
Http://infolab.stanford.edu/~echang/recsys08-69.pdf gives a method
1. The hypothesis is divided into two groups G1 and g2,g1 contain commodity c,a,p; G2 contains merchandise f,b,m,
2. Processing of each basket data
The first basket data: f,c,a,m,p, divided into two groups of G1,G2, the idea is based on the product map out a lot of <k,v>, and here is no longer based on the product map, but according to the group map, the first shopping basket, the G1 group, the rightmost is P, Then write down the shopping basket from right to left Fcamp, here key is G1,value is fcamp; When G2 is key, value is Fcam, i.e. <G1,fcamp><G2,fcam>;
Second Basket data: f,c,a,b,m, divided into two groups of G1,g2,<g1,fca><g2,fcabm>
Third Basket data: f,b, divided into two groups of G1,g2,<g1,null><g2,fb>
Fourth basket data: C,b,p divided into two groups of G1,g2,<g1,cbp><g2,cb>
Fifth Basket data: C,b,p divided into two groups of G1,g2,<g1,fcamp><g2,fcam>
Process such as:
3. The above <k,v> will be mapped to g1,g2 two machines, respectively, to reconstruct the Fp-tree
PFP (Parallel fpgrowth)