( references from an efficient approach for maintaining association Rules based on adjusting fp-tree Structure
jia-ling Koh and Shui-feng Shieh Department of information and computer education
There are a large number of limitations, if you want to go straight to the topic, see dry Goods, can be directly from 3, adjust Fp_tree strategy began to look at the @ouym )
1 , Introduction
Data mining has aroused wide attention in the research of database, because it has been widely used in many fields. Mining Association rules in customer transactions are important in various data mining applications. Customer transactions are usually made up of customer IDs, trading hours, and all items purchased in the transaction. The Mining Association rules for this database are to find all the rules, such as "buy items X and Y customers who have n% in the same transaction also buy Item Z", where N,x,y,z was initially unknown. Such rules are useful for custom marketing decisions.
Several effective algorithms have been proposed for finding frequent itemsets, and association rules are derived from frequent itemsets, such as Apriori and DHP algorithms. These "class Apriori" algorithms are subject to (1) processing a large number of candidate sets (2) to duplicate the scan database limit. Jiawei Han presents a frequent pattern tree (FP-TREE) structure for storing frequent pattern compression structures and key information. In addition, an algorithm called Fp-growth is developed for mining a complete set of frequent itemsets from the Fp-tree. This approach avoids the cost of generating a large number of candidate sets and duplicate database scans, which is considered to be the most effective strategy for mining frequent itemsets.
Updating a transactional database may invalidate existing rules or introduce new rules. Update Association rules to quickly find new sets of frequent itemsets in the updated database. A simple solution to updating association rules is to re-mining the entire updated database. However, the invalidity (or efficiency) of this approach is obvious, as all calculations done in previous excavations are wasteful. Some people have studied how to keep the association rules previously discovered, and update the new association rules on this basis, and put forward the fup algorithm of adding updated association rules when adding new transaction data. In order to solve the general problem, including inserting, deleting and modifying the transaction in the database, the FUP algorithm is modified and the FUP2 algorithm is developed. It is also proposed that a new incremental technique, similar to the FUP algorithm, is used for mining multi-level association rules. As the "Class Apriori" algorithm, all FUP series must produce a large number of candidates and scan the database repeatedly.
In addition to the frequent itemsets, additional information is retained by the proposed incremental update technology. An incremental mining algorithm is proposed to search for frequent sequential patterns. The previous sequence is retained in the algorithm, which is a sequence of support between the lower support threshold and the upper threshold value. Bindings that need to be derived from the lower and upper thresholds that are determined when the original database is re-scanned. In another algorithm, the negative boundary is consistent with the frequent itemsets. If you add an item set that is outside a negative boundary to a frequent itemsets or its negative boundaries, the algorithm requires a full scan of the entire database.
In this article, we propose an algorithm called AFPIM (Incremental mining tuning fp-tree) so that the transactional database can efficiently find new frequent itemsets with minimal computational weight when adding, deleting, or modifying new transactions. In our approach, the fp-tree structure of the original database is maintained in addition to the frequent project set. In most cases, you do not need to rescan the entire database to obtain the FP-TREE structure of the updated database by adjusting the previous fp-tree based on the inserted and deleted transactions. Then the frequent itemsets of updating database are excavated from the new fp-tree structure, and the corresponding association rules are found.
2 , Problem dwscription
Basic concepts:
In this article, the concept of a "pre-sequence" (pre-large sequences) is applied. In addition to the minimum support threshold, a smaller threshold is specified, called "Pre-minimum Support" (pre-minimumsupports). For each item x in the database, if its support count is not less than the minimum support, the X is named as a frequent item. If the number of support for X is less than the minimum support and is not less than "pre-minimum support," X is called "pre-frequent." Otherwise, X is a non-frequent item. In the following cases, frequent items and pre-frequent items are named "frequent or pre-frequent items".
In order to effectively resolve the update issue, we maintain the following information after mining the original database db.
1. All projects in the database and the count of support in the database.
2. The FP tree of the database, built on frequent or pre-frequent projects in DB.
The items in the DB X are frequent, pre-frequent (pre-frequent) or in frequent (in-frequents) items, and they will become frequent, pre-frequent, or frequent items in UD.
3 , adjust the Fp_tree strategy:
Raw Database db. Minimum support is 0.2, pre-minimum support (Pre-minimum supports) is 0.15. A:2,b:6,c:5,d:3,e:4,f:7,g:1 and H:1 (: indicates support count). A supported project of no less than 2 (that is, 13x0.15 = 1.95) is a frequent or pre-frequent item in the DB. Therefore, a,b,c,d,e and F are frequent entries in the DB. After all the frequent or pre-frequent items are sorted sequentially, the results are f:7,b:6,c:5,e:4,d:3 and a:2. It then inserts 5 transactions and deletes 3 transactions from the DB. The support counts for all projects in DB + and db-are on top. For each item x in UD, the number of supports can be obtained by simple calculation. The result is a:2,b:9,c:5,d:7,e:6,f:8,g:1 and h:2. In the new database UD, a pre-frequent or frequent project must have a support count of not less than 3. Therefore, the frequent or frequent set of 1 items in the UD display in descending order is b:9,f:8,d:7,e:6 and c:5.
To delete a non-frequent node:
Assuming that item I is not frequent in UD, the node representing I must be removed from the FP tree. Starting at project I, after fp-tree the header and following the I node link, all nodes representing I can be obtained individually. For each node n, make n.item-name = I, set its child nodes as child nodes of its parent node, and remove n from the I-node link. Finally, delete the I entry in the Header table. A is not a frequent or pre-frequent item in UD. Therefore, the node that represents a is removed from the FP tree.
A list of frequently or pre-frequent items in db that support descending numbers, and a bubbling sort algorithm to determine how to accommodate adjacent items in UD in descending order of exchange. As shown, 4 interchanges are required to adjust the support descending in the DB to conform to the order in UD. After deciding on the pair of items to be exchanged, in the DB Fp-tree, you must adjust the path that contains the pair of interchange items. The adjustment method is described below. Suppose there is a path in the fp-tree of the DB where node y is a child of node x and the items represented in nodes x and Y must be exchanged. In addition, node p is the parent node of X. If X.count is greater than Y.count, perform steps 1 through 3. Otherwise, perform steps 2 through 3.
"Step 1" Insert: Insert node P's child node X ', where X '. Count is set to X.count-y.count. x all child nodes except node Y are assigned as child nodes of X '. Additionally, the x.count is reset to equal to Y.count.
Step 2 Interchange: Exchange the parent links and sub-links of node X and Node y, respectively.
Step 3 Merge: Checks for the presence of a P node, expressed as node Z,z.item-name = Y.item-name. If node z exists, the Z.count is added to Y.count, and Node z is deleted.
First Iteration adjustment : Exchange nodes for Project F and B
From the Fp-tree Header table, you can get the node link for Project F. First, the node that carries the project name F is represented as Node x, and the child node y of x exists, making the name B. Therefore, the adjustment method is executed.
(1) Insert: For the parent node of X, in this case, the root node is inserted into a child node
Node X ' carries the project name F. The Count field for X ' is assigned as 2, or 7-5. Additionally, the X.count is reassigned to equals y.count,5. Finally, insert X ' into the node link of F.
(2) Swap: Swap node x and Node Y.
(3) Merge: There is another child node of the root node z, so that "Z" carries the same project name as "Y" B. Therefore, the Count value Z in the node is merged into Node y, and Z is removed.
Get the next node that carries the project name F. However, the node does not have a child node named B, so it has reached the end of the node link of Project F, and this iteration is complete;
Second adjustment Iteration : Exchange nodes for items C and E
Similar to the process of the first iteration, the first node that carries the project name C is represented as Node x, and it discovers the child node y that carries the project name E. Then make the adjustment method. In this case, node X has a second child node other than node Y. Therefore, the node S is assigned as a child node of the newly added node X ';
Third adjustment: Exchange nodes for items C and D :
Carry project name c the first node and its child nodes carry the project name D., respectively, as node X and y in this case, X.count equals Y.count. Therefore, only steps 2 and 3 are used to adjust the method. The next set of nodes for Project C and D, which represents an adjustment method for nodes x and Y, and a interchange entry for items F and B in the Header table;
Fourth iteration adjustment: Exchange nodes for items E and D.
4 , insert, or delete data:
After adjusting the path of the node inside the Fp-tree, each path in the resulting fp-tree follows the frequency descending of frequent or prefixed frequent entries in UD. For each transaction in DB + T, select the frequent or pre-frequent items of the UD contained in T, and sort in descending order. The transaction t will be inserted into the fp-tree as the process of building the fp-tree. Similarly, each transaction T is removed from the FP tree in db-.
5 , AFPIM algorithm:
Based on the strategy described above, we have the following AFPIM algorithm to adjust the fp-tree structure of the incremental mining frequent itemsets.
Step 1 reads the support count in the project in DB and its db.
Step 2 scans the database db+ and db-once, listing all the projects in db+ and db-and their support numbers, respectively. For each item x, calculate the support count for X in UD according to the formula:
Then collect frequent or pre-frequent items in UD.
"Step 3" Determines whether all UD frequent items are covered in the fp-tree of the DB.
"Step 3.1" If there are frequent items of UD in Fp-tree, then the entire UD needs to be scanned once to be based on the prefix items in the FP tree or UD that are frequently rebuilt for UD.
"Step 3.2" otherwise, read into the fp-tree of the stored database.
"Step 3.2.1" for each infrequent item in UD, remove the corresponding node FP tree from it.
Step 3.2.2 Applies the bubble sort algorithm to exchange the order of items based on the descending order of support numbers in DB and UD. Then use the path adjustment method repeatedly to adjust the path in the structure Fp-tree.
"Step 3.2.3" scans the database db+ and db-the second time, calling the function Tree_insert_delete inserting the transaction db+ and deleting the transaction db-. Finally, we get the FP tree of the obtained UD.
"Step 4" using Fp-growth algorithm and td-fp-growth algorithm to find the frequent itemsets in UD from the FP Tree of UD respectively
Pseudo code:
Association rule Mining Algorithm AFPIM