There is a very important application in data mining, that is, frequent pattern mining, translating into Chinese is the frequent pattern mining. This blog is about some of the algorithms associated with frequent pattern mining.
Defined
What is frequent pattern mining? The so-called frequent patterns refer to patterns that occur frequently in sample data sets. For example, in a supermarket trading system, a lot of transactions are recorded, and the information for each transaction includes the list of items purchased by the user. If the head of the supermarket is a conscientious person, he will find that the urine is not wet, the two products of beer in many users ' shopping list appear, and the frequency is very high. Urine is not wet, beer at the same time appear on a shopping list can be called a frequent pattern, such excavations can be called frequent pattern mining. Such excavations are very meaningful, and the examples above are real examples of what happens in Wal-Mart supermarkets, which are now being talked about by industry.
Aprior Mining algorithm:
So the next question is natural, how can users effectively dig out all of these patterns? Let's discuss the simplest and most natural way. Before we talk about this algorithm, we first declare a feature-aprior feature in frequent pattern mining.
Aprior Features:
This feature means that if an item set (project collection) is not a frequent item set (a frequent collection), then any set of items that contain it must not be a frequent collection. The collection here is a pattern. This feature is natural and easy to understand. For example, look at the Wal-Mart supermarket example above, If a beer is a product that appears only 1 times on all shopping lists, then any combination of shopping items that contain beer, such as (beer, wet urine), will only occur once, if we decide that a collection of items with more than 2 occurrences can be called a frequent collection, So the shopping mix that contains beer is certainly not a frequent collection. Conversely, if an item collection is a frequent collection, then its arbitrary non-empty set is also a frequent collection.
With this feature, you can eliminate some of the impossible items in the mining process, avoiding unnecessary computational waste. This method consists of two operations: product (cross product) and prune (pruning). These two operations are at the core of the entire approach.
The first is product:
There are a few definitions, L (k)-The candidate project queue, which contains a series of project collections (that is, the queue is a collection of project collections), the length of the project collection is the same, K, this length we call rank (hehe, rank is my own Chinese name), These sets of the same length are called K-sets. Then there is L (k+1) =l (k) product L (k). That is, the candidate queue with rank K can generate a candidate queue of rank k+1 with the product operation (self-cross product). Note that the K-sets in all candidate queues here are sequenced in alphabetical order (or in some other predefined order). OK, here's the key, how does product work? The product operation is for the K-collection in the candidate queue, which is actually the K-set 22 in the candidate queue to perform the join operation. There is a precondition that the K-collection l1,l2 be able to join, that is, the first k-1 items of the two K-sets are the same, and the Order of L1 (K) is greater than L2 (k) (the requirement of this order is to exclude duplicate results). This premise is represented by a formula (L1[1]=l2[1]) ^ (l1[2]=l2[2]) ^...^ (l1[k-1]=l2[k-1]) ^ (L1[k]<l2[k]). Then the result of the join forms a set of k+1 lengths l1[1],l1[2] ,..., L1[k-1],l1[k],l2[k]. If the join operation is completed between all K-set 22 in the L (k) queue, then these formed sets of k+1 lengths constitute a new candidate queue L (k+1) with the rank k+1.
Pruning operation:
This operation is for the candidate queue, which filters all K-sets in the candidate queue, and the filtering process scans the database one at a time, removing the K-collection from the L (k) candidate queue that is not a frequent item collection. Why did you do it? Remember the Aprior feature mentioned earlier? Because these K-sets, which are not frequent collections, cannot generate frequent collections through the product operation, it has no contribution to producing frequent collections of product operations, leaving it in the queue of candidates with no other advantage in addition to increasing complexity, it is removed from the queues.
These two operations constitute the core of the algorithm, the user from the rank of 1 Project candidate queue start, through product operation, pruning operation to generate Rank 2 candidate queue, and then through the same 2-step operation to generate Rank 3 candidate queue, has been circulating operation, This is equal to the support count until all K-sets appear in the candidate queue.
Here is a specific example, can be very good to illustrate the above algorithm thought:
This algorithm is more clear and direct, the implementation is relatively simple. However, the disadvantage is that the cost is very large, each pruning operation will scan the database, each product operation needs to the queue of K-set 22 join operation, the complexity of C (sizeof (L (k)), 2).
In order to improve the efficiency of the algorithm, Han Jiawei proposed the FP growth algorithm, which makes the mining efficiency of the frequent pattern increase a magnitude.
FP Tree construction
The FP growth algorithm takes advantage of ingenious data structures, greatly reducing the cost of the Aproir mining algorithm, and he does not need to constantly generate candidate project queues and constantly scan the entire database for comparison. To achieve this effect, it uses a concise data structure called Frequent-pattern tree (frequent pattern tree). Here's a detailed discussion of how to construct this tree, for example, the best way. Take a look at the following example:
This table describes a commodity trading list, ABCDEFG represents a commodity, (ordered) frequent items This column is the order of the goods in descending order, this sort is very important, we operate all the items must follow this sequence, the determination of this order is very simple, This order can be obtained once the database has been scanned. These non-frequent items are excluded from this column because the non-frequent projects have no effect on the entire excavation. In this example, we set the minimum support threshold (minimum supports threshold) to 3.
Our goal is to construct a tree for the entire commodity trading list. We first define the root node of this tree as NULL, and then we start to scan each record of the entire database to start constructing the FP tree.
First step: Scan the database for the first trade, that is, the TID is 100 transactions. Then you will get the first branch of this tree < (f:1), (C:1), (a:1), (m:1), (p:1) >. Note that this branch must be arranged according to the descending frequency.
Step two: Scan the second transaction (TID=200), we will have such a frequent item collection <f,c,a,b,m>. Looking closely at this queue, you will find that the top 3 items of this collection <f,c,a> the first three of the path <f,c,a,m,p> generated by the initial step are the same, meaning they can share a prefix. We then add 1 to the number of <f,c,a> three nodes based on the path generated in the first step, then add < (b:1), (m:1) > as a branch to the (a:2) node, and become its child node. See
Step three: Then scan the third transaction (TID=300), you will see that the collection of this record is <f, b> Compared to the existing path, only f is the common prefix, then the F node plus 1, and then the F node to generate a new byte point (b:1). There will be:
Fourth step: Continue to see the fourth transaction, its collection is <c,b,p> Oh, it's not the same again. You will find that the first element of this collection is C, which is different from the first node F of the existing known path, so there is no need to go down, without any public prefixes. Attaches the collection directly as a sub-path of the root node. is obtained (Fig. 1):
Fifth step: The last transaction comes, and you see a collection of <f,c,a,m,p>. You're pleasantly surprised to find that the path is exactly the same as the tree's left-most path. So, the whole path is a public prefix, so all the points on this path are 1 better. The final figure is obtained (Figure 2).
Well, a FP tree has been basically built. Wait, it's almost there. The above tree is a little bit less than you can call a complete FP tree. To facilitate the traversal of the tree behind, we added a structure to the tree- head table, the head table holds all the frequent items, and in descending order of frequency, each item in the table contains a node-linked list, pointing to a node in the tree with the same name. Wordy for a long while, may still not clear, OK straight, a look you will understand:
The above is the complete process of the whole FP tree construction. Smart readers must be easy. Based on the above examples, the construction algorithm of FP tree is summarized. We will not repeat it here. The detailed algorithm is referenced in article 1.
FP the excavation of the tree
Here's the key thing. We already have a very concise data structure, and the next task is to dig through the tree to get the frequent project collections we need without having to access the database again. Or look at the example above.
The first step: Our excavation begins with the last item of the table, so an obvious direct frequent set is (P:3). According to the node list of p, its 2 nodes exist in 2 paths: path <f:4,c:3,a:3,m:2,p:2> and path <c:1,b:1,p:1> from Path <f:4,c:3,a:3,m:2,p:2> We can see that the path containing P <f,c,a,m,p> appeared 2 times, and there were <f,c,a> 3 times,<f> appeared 4 times. But we only focus on <F,C,A,M,P>, because our aim is to find all the frequent collections that contain p. In the same way, we can conclude that <c,b,p> has appeared in the database 1 times. So, P has 2 prefix path {(fcam:2), (Cb:1)}. These two prefix paths are called the sub-schema base (subpattern-base) of P, also called the conditional pattern base of P (the reason is called conditional pattern base because this sub-pattern base is under the precondition of P existence). Next, we construct a conditional FP tree for this conditional sub-pattern base. Recall the above FP tree construction algorithm, it is easy to get the following tree:
But because the threshold of the frequent set is 3. So in fact the tree has only one branch left after pruning (c:3), so only one frequent item set {Cp:3} can be derived from this conditional FP tree. Plus the direct frequent set (P:3) is the final result.
Step two: We're going to start digging the second-to-last m in the head table, as with the first step, apparently with a direct frequent set (M:3). See the two paths that it exists in the FP tree <f:4,c:3,a:3,m:2> and <f:4,c:3,a:3,b1,m:1 Then its frequent conditional sub-mode base is {(Fca:2), (Fcab:1)}. To construct the FP tree for this sub-schema base and discard branch B that does not meet the minimum frequent threshold, there is only one frequent path in this FP tree <f:3,c:3,a:3 Since this FP tree is present and not a special tree with only one node, we continue to dig the tree recursively. This subtrees tree is a single-path subtree, and we can simplify writing mine (FP tree|m) =mine (<f:3,c:3,a:3>| M:3).
Here's how to dig this FP subtree, we need recursion. The recursive subtree also requires several steps:
1 The last node of the FP subtree is a, combined with the node m before recursion, then we get the conditional sub-schema base {(Fc:3)} for AM, then this sub-schema-based FP tree (which we call M Shizi) is actually a single-path tree <f:3,c:3> The next continuation also continues with the recursive mining sub-subtree mine (<f:3,c:3>|am:3). (The recursive analysis of sub-subtrees is temporarily suspended.) because the recursion of the sub-tree is analyzed, the text will appear too confusing.
2 Similarly, the penultimate node of the FP subtree Header table is C, combined with recursive front node m, there is a recursive mining mine (<f:3>|cm:3) that we need.
3 FP subtree The penultimate node is also the last node is F, combined with the pre-recursive m-node, actually need to recursively mining mine (null|fm:3), in fact, the recursion in this case can be terminated, because the subtree is already empty. Therefore, you can return frequent collections in this case <FM :3>
Note: These three steps also contain their direct frequent sub-pattern <am:3>,<cm:3>,<fm:3>, which is the same at every step of the recursive call mine<fptree>, no longer wordy one by one re-specified.
In fact, this is a very simple recursive process, do not continue to analyze, smart readers will be based on the above analysis continue to deduce recursion, you will get the following results.
Mine (<f:3,c:3>|am:3) =><cam:3>,<fam:3>,<fcam:3>
Mine (<f:3>|cm:3) =><fcm:3>
Mine (null|fm:3) =><fm:3>
These three steps also contain their direct frequent sub-pattern <am:3>,<cm:3>,<fm:3>
Finally, with the direct frequent sub-pattern of M, <m:3>, is the whole second step of digging M's final result. Please look
Step three: Take a look at the bottom of the head table the third <b:3> mining, it has three paths <f:4,c:3,a:3,b:1>,<f:4,b:1>,<c:1,b:1>, the formation of the frequent conditional sub-pattern base is {(fca:1 ), (F:1), (C:1)}, all nodes in the FP tree that are built are less than 3, then the FP tree is empty and ends recursively. The frequent set of this step is only the direct frequent collection <b:3>
Fourth step: The head table is the fourth digit <a:3> it has a path <f:4,c:3> the frequent conditional sub-pattern base is {(Fc:3)}, which forms a single-path FP tree. In fact, some people may have already discovered that this single-path FP tree mining does not have to be recursive so troublesome, as long as the arrangement of the combination can directly form the final result. Indeed it is. Then the final result of this step is based on the permutation combination: {(fa:3), (Ca:3), (Fca:3), (A:3)}
Fifth step: The fifth digit of the head table <c:4> it has only one path <f:4> the frequent conditional sub-pattern base is {(F:3)}, then the frequent set of this step is obvious: {(Fc:3), (C:4)}
Sixth step: The last <f:4> of the head table;, there is no conditional sub-mode base, then there is only one direct frequent set {(F:4)}
The results of these 6 steps are added together to get all the frequent episodes we need. The frequent conditional pattern base of each step is given.
In fact, by the above example, it is estimated that the early people have seen, this single-path FP tree mining is actually a regular, no recursion so complex method, through the arrangement of the combination can be directly generated. That's true, Han. The Jiawei is optimized for this single-path scenario. If a FP tree has a long single path, we divide the FP tree into two subtrees: a subtree consists of a single-path part of the original FP tree, and the other subtree consists of the remainder of the original FP tree except for the single path. FP for these two subtrees The growth algorithm is then combined with the final result to be able to do so.
Through the above bloggers take pains, tireless, slightly verbose analysis, I believe you already know the FP growth algorithm final secret. In fact, the idea behind the algorithm is very simple, with a concise data structure to the entire database of FP mining required information is included in, through the data structure of the recursive can be completed the entire frequent pattern of mining. Since the size of this data structure is much smaller than the database, it can be stored in memory, so the mining speed can be greatly improved.
Maybe someone would ask? If the database is large enough that the built FP tree is too large to be fully stored in memory, this is a good thing. This is really a problem. Han Jiawei in the paper also gives a way of thinking, is to partition the original large database into a few small database (this small database called the projection database), the few small databases of the FP growth algorithm.
Take the above example, we put all the database records containing p into a single database, we call it a P-projection database, similar to the m,b,a,c,f we can generate the corresponding projection database, the projection database is a relatively small size of the FP tree, It can be put in memory completely.
In the modern data Mining task, the data volume is more and more big, so the demand of parallelization is more and more big, the problem raised above is more and more urgent. Next blog, the blogger will analyze how FP growth is parallelized in the framework of MapReduce.
The previous blog has analyzed a very important algorithm in association analysis-FP growth. The algorithm constructs a compact data structure in memory based on the database-fp tree, and through the continuous recursive mining of the FP tree can get all the complete frequent Patterns. However, in the current situation of massive data, the FP tree is too large to reside in the memory of the computer. Therefore, parallelization is the only option. This blog is mainly about how to perform parallel FP mining under the MapReduce framework, and its main algorithms are described in detail in document 1.
How to parallelize the FP growth? A natural idea is to divide the original database into partitions, which are on different machines, so we can do FP growth mining for different data partitions, and finally combine the results of different machines to get the final result. Indeed, this is the right idea. But the question is: how do we divide the database into chunks? If the FP growth can truly be parallelized independently, then it is necessary that these data partitions be able to be independent of each other, that is, these partitions are complete for a certain part of the project. So there is a way: through a scan of the database, construct a frequent item list f_list = {i1:count1, I2:count2, I3:count3 ...} ^ (count1> count2 > count3> ...), then divide the f_list into several groups, forming several g_list. This time we scan every transaction of the database, if this transaction contains an item in G_list, Then this transaction is added to the group's corresponding database partition, which forms several database partitions, each of which corresponds to a group and a group_list. This partitioning method guarantees that the database partition is complete for the item in the Group_list. This partitioning method causes data to be redundant, because a transaction may have backups in different partitions, but in order to maintain the independence of the data, this is a last resort.
The following is a brief discussion of the algorithm's steps:
The first step: the database partition. Divide the database into contiguous partitions, each of which is distributed across different machines. Each of these partitions is called Shard.
Step two: Calculate F_list, which is the support count for all item. This calculation can be done with a mapreduce. Consider the example of Word count on Hadoop, which is essentially the same as this step.
The third step is to group the entries. Divide the entries in F_list into Q groups, so that each group in a group_list,group_list is assigned a group_id, each group_list contains a set of items.
Fourth step: Parallel FP growth. This step is the key. It is also done by a mapreduce. Concrete to see.
Mapper:
The main function that this mapper completes is the database partition. It differs from the Shard in the first step, it takes the first step shard the database partition, one handles each transaction in the Shard database partition, divides the transaction into one item, each item according to the GROUP_ The list is mapped to the appropriate group. In this case, by Mapper, the item collection belonging to the same group is aggregated onto a machine, thus forming the complete data set we talked about earlier, and in the next step reducer the FP growth algorithm can be parallel.
Reducer:
The local fp_growth algorithm based on the complete data set formed by mapper
Fifth step: Aggregation, the results of each machine are aggregated into the final results we need.
The previous blog has analyzed a very important algorithm in association analysis-FP growth. The algorithm constructs a compact data structure in memory based on the database-fp tree, and through the continuous recursive mining of the FP tree can get all the complete frequent Patterns. However, in the current situation of massive data, the FP tree is too large to reside in the memory of the computer. Therefore, parallelization is the only option. This blog is mainly about how to perform parallel FP mining under the MapReduce framework, and its main algorithms are described in detail in document 1.
How to parallelize the FP growth? A natural idea is to divide the original database into partitions, which are on different machines, so we can do FP growth mining for different data partitions, and finally combine the results of different machines to get the final result. Indeed, this is the right idea. But the question is: how do we divide the database into chunks? If the FP growth can truly be parallelized independently, then it is necessary that these data partitions be able to be independent of each other, that is, these partitions are complete for a certain part of the project. So there is a way: through a scan of the database, construct a frequent item list f_list = {i1:count1, I2:count2, I3:count3 ...} ^ (count1> count2 > count3> ...), then divide the f_list into several groups, forming several g_list. This time we scan every transaction of the database, if this transaction contains an item in G_list, Then this transaction is added to the group's corresponding database partition, which forms several database partitions, each of which corresponds to a group and a group_list. This partitioning method guarantees that the database partition is complete for the item in the Group_list. This partitioning method causes data to be redundant, because a transaction may have backups in different partitions, but in order to maintain the independence of the data, this is a last resort.
The following is a brief discussion of the algorithm's steps:
The first step: the database partition. Divide the database into contiguous partitions, each of which is distributed across different machines. Each of these partitions is called Shard.
Step two: Calculate F_list, which is the support count for all item. This calculation can be done with a mapreduce. Consider the example of Word count on Hadoop, which is essentially the same as this step.
The third step is to group the entries. Divide the entries in F_list into Q groups, so that each group in a group_list,group_list is assigned a group_id, each group_list contains a set of items.
Fourth step: Parallel FP growth. This step is the key. It is also done by a mapreduce. Concrete to see.
Mapper:
The main function that this mapper completes is the database partition. It differs from the Shard in the first step, it takes the first step shard the database partition, one handles each transaction in the Shard database partition, divides the transaction into one item, each item according to the GROUP_ The list is mapped to the appropriate group. In this case, by Mapper, the item collection belonging to the same group is aggregated onto a machine, thus forming the complete data set we talked about earlier, and in the next step reducer the FP growth algorithm can be parallel.
Reducer:
The local fp_growth algorithm based on the complete data set formed by mapper
Fifth step: Aggregation, the results of each machine are aggregated into the final results we need.
The diagram above gives a block diagram of the algorithm steps. With this block diagram, you may have some understanding of the steps of the algorithm. The following blog is a detailed analysis of each step.
Detailed steps of the FP growth algorithm under the MapReduce framework are analyzed.
Sharding
This step is nothing to say, divide the database into contiguous chunks of equal size, placed on different machines. In terms of Hadoop, the framework itself places the entire database on different machines, creating different partitions, so we don't have to do anything on Hadoop itself.
f_list Calculation
There's nothing to say about this step, it's a simple frequency statistic, which is the simplest application of mapreduce. The following is a pseudo-code, the reader's own analysis is easy to see clearly.
Item grouping
This is not a very difficult step. Divide the f_list into several groups. From this step, to better illustrate the detailed steps of the algorithm, let's give an example to illustrate the problem. Examples of all the steps behind are based on this example.
Or This example in teacher Han's paper, let's say we are doing a parallel FP grwoth for this database on two machines. So after the second step, we get the f_list as follows:
F_list = {(F:4), (C:4), (A:3), (B:3), (M:3), (P:3)}
So in this step, we divide it into 2 groups, forming a g_list:
g_list={{group1: (f:4), (C:4), (A:3)}, {group2: (B:3), (M:3), (P:3})
parallelization of fpgrowth
Directly on the algorithm pseudo code.
Let's take a look at this example with an example. First look at Maper, pseudo-code in the G_list is the g_list mentioned above:
g_list={{group1: (f:4), (C:4), (A:3)}, {group2: (B:3), (M:3), (P:3})
The hash table h is a form of this:
Key |
Val |
F |
Group1 |
C |
Group1 |
A |
Group1 |
B |
Group2 |
M |
Group2 |
P |
Group2 |
Hashnum is the value in the hash table, which is actually the group ID.
A[] is the array in which each transaction is split into an item form. As an example, there is
T1 (tid=100)
A[] = {F, c, a, m, p}
T2 (tid=200)
A[] = {F,c,a,b,m}
T3 (tid=300)
A[] = {F,b}
T4 (tid=400)
A[] = {C,b,p}
T5 (tid=500)
A[]={F,C,A,M,P}
Suppose T1,t2,t3 is a shard,t4,t5 is a shard. So the process of mapper is actually such a process:
Look at the groups where all the item in transaction t belongs, and then send the T to the appropriate group. For example T1, it traverses all elements, first see p, according to the hash table, it belongs to group2, then he output <group2, t1>. All val=group2 entries in the hash table are deleted. Then the hash table becomes:
Key |
Val |
F |
Group1 |
C |
Group1 |
A |
Group1 |
Then see M, then look at the hash table, found that M's item is not, that is, T1 in the previous traversal step has been sent to the group M, so we do not need to re-send this record again (M would correspond to the group2, but in the previous P processing, T1 has been sent to group2, in this step will not be sent repeatedly T1 to group2), so do nothing, return. Continue the traversal.
And then traverse through B, no hash table, same as M.
Next look at a, through the hash table to know that it belongs to Group1, then the T1 sent to Group1, and then the hash table val=group1 all the items are deleted, that is, the steps behind the notification, T1 has been sent to Group1, the steps behind do not send to group1. After this step, the hash table is empty, which means that the group containing the item T1 has already been processed.
The next c,f to deal with nothing to do.
According to the above detailed and wordy analysis, actually can see, this step mapper purpose is very simple, put all the transaction on their machine to the appropriate group up, send the principle is transaction contains the item belongs to which group, The goal is to send these group. By deleting a hash table entry, make sure that a transaction is not sent repeatedly to a group.
In fact, careful readers may ask the blogger at this time? No, the bloggers are lying. The algorithm does not send transaction entire records to different group, but different group sends different parts of transaction. Because the output of the algorithm pseudo code Maper is obviously:
OUPUT<HASHNUM,A[0]+...+A[J]>=OUTPUT<GROUP_ID, a[0]+...+a[j]>
In the example above, the item sequence sent to Group2 by T1 is
{f, C, a, B, M, p}
And the item sequence sent to Group1 is
{F,c,a}
Why is that? Isn't it incomplete that the group1 data is partially discarded? Does this not lead to a complete mining result?
In fact, if you notice that all of the transaction's item is ranked from high to low according to f_list frequency, you may understand why we can do such an optimization. In the previous example, we sent {F, C, a} to Group1 's machine, and we wanted to be able to dig up the frequent pattern of all group1-containing item group1 on the machine, frequent Pattern contains the item in one or more group1. This part of the data we discard is the item with a smaller frequency than itself. So let's take a look at our group distribution:
Group1:f, C,a | Group2:b, M, p
The frequency dividing line of group1 and group2 is the frequency of a, in this case the cutoff point is 3, which means that all item occurrences in group1 are greater than or equal to 3,group2 all item frequencies are less than or equal to 3.
With Group2 's FP mining, we can get all the item sequences that contain a subset of the {B,m,p} collection, such as {b:3, bm:2, Pm:2, Fb:3 ...} and so on. Obviously, the frequency of the elements in this sequence will not be more than 3, because after the second step of f_list statistics, {b,m,p} will not appear more than 3 times.
Similarly, group1 digs out all the item sequences that contain a subset of the elements in the group1, such as {F:4,fc:3,ca:3 ...}, and the elements in this sequence will not be less than 3 in frequency. We have this worry when digging in group1: The result is to discard the item combination containing {b,m,p}, so if there is a frequent pattern combination like FB,FM,FP, will it be discarded by us? The answer is no.
If we set the support threshold to a number less than 3, such as 2. So in the excavation of group2, the results of FB,FM,FP have been included. The data discarded by Group1 is not affected.
If the support threshold is equal to 3, then the frequency of the combination such as FB,FM,FP is certainly less than 3 (because the b,m,p frequency is less than 3), of course it should be dropped. So it does not cause losses.
If the threshold value is greater than 3, then the loss is less, because the frequency of the combination such as FB,FM,FP is certainly less than 3 (because the b,m,p frequency is less than 3).
So, you should understand why mapper do the output group can shed some data, the root cause is that all the transaction must be in accordance with a certain order of unity. The above examples are grouped exactly according to the frequency of the order to divide, in fact, is a special case, easy to explain, can also be used in other order. Grouping actually does not need to be a large frequency of the item into a group, the low-frequency item is divided into a group, can be grouped by any strategy. In essence, the reason why we can shed some data is that the GROUPA^GROUPB collection already contains the transaction that was distributed to Groupa, so GROUPB does not need to contain GROUPA^GROUPB data. The premise of this inference is that all the item permutations in the transaction must have a single, consistent order (whatever sort can be arbitrary, in alphabetical order, and at a frequency), the meaning of this order is All transaction are truncated in the same order that they are sent to different group, such as:
I1–GRP1, I2–grp2, I3-grp1,... In-grp1
Then look at the reducer in the pseudo-code:
The reducer input on each machine is the transaction set corresponding to a group_id. The first is to get g_list, as before. Nowgroup is a collection of all the item that is assigned to this machine, in the example above, if it is on a group1 machine, then there is nowgroup={f,c,a}.
The next step is to generate the local FP tree Localfptree, which is also well understood. Next, Reducer defines a heap of size K for each item in Nowgroup, which is used to store the frequent pattern containing the item, which is frequent pattern with the classic FP Growth algorithm to dig out. (Remember to heap this data structure, the complete binary tree, any node is greater than or smaller than its child nodes.) The data that is stockpiled here is the top K frequent pattern containing the specified item in the frequency row. Finally, the HP data is output.
In fact, at this stage, the result has come out. Each of our machines has a top K frequent pattern for each nowgroup item, and each frequent pattern contains this item. According to the previous analysis, we also know that since all the processed transaction are ordered, there is no overlap between the results generated on each machine. This is a step that can be said. But in order to make the final result better, you can add another step to the process: result aggregation.
Result aggregation
Please look at the pseudo code as follows:
This mapreduce actually completes a function of index, which takes the result of the previous step in a process, which indexes the frequent pattern obtained in the previous step according to item, and the final result is that an item corresponds to a set of frequent Pattern It puts these frequent pattern in a heap for easy access in high and low order of frequency. Pseudo-code in the If-else is actually the frequent pattern into the heap, if the heap is full, and the frequency of the smallest node (that is, the root node) comparison, if the value of the new node is large, delete the root node, insert a new node. The purpose of this is to always keep the top K frequent pattern with the highest frequency of the item stored in the heap.
This is the implementation of the FP growth algorithm in the map reduce framework. It has been implemented in Apache's Open source project Mathout and can be used directly by everyone.
Finally, a bit of personal opinion, this FP growth MapReduce implementations can consider some load balancing strategies when grouping f_list, because without any strategy, it is possible to cause high-frequency item to be in a group, Then launch to this group of the corresponding machine transaction will be very much, the processing pressure will be particularly large, and other machines on the task is too easy, which for the overall system efficiency improvement is very unfavorable. If a certain strategy is taken into consideration, it will be more efficient to distribute the high-frequency item evenly to each machine.
Transfer from http://blog.sina.com.cn/s/articlelist_1761593252_0_2.html
Aprior algorithm, FP growth algorithm