Summary of association rule mining algorithms)

Source: Internet
Author: User
Abstract This article introduces the basic concepts and classification methods of association rules, and lists some association rule mining methods. Algorithm This paper briefly analyzes typical algorithms and looks forward to the future research direction of association rule mining.

1 Introduction

Association Rule Mining finds interesting associations or relationships between item sets in a large amount of data. It is an important topic in Data Mining and has been widely studied in the industry in recent years.

A typical example of association rule mining is basket analysis. Association rule research helps to discover the relationship between different commodities (items) in the transaction database and find out the customer purchasing behavior patterns, such as the impact of purchasing a certain commodity on purchasing other commodities. The analysis results can be applied to the product shelf layout, storage arrangement, and user classification based on the purchase mode.

Agrawal first proposed the issue of mining association rules between item sets in the customer transaction database in 1993. In the future, many researchers have conducted a lot of research on association rule mining issues. Their work includes optimizing the original algorithms, such as introducing random sampling and parallel ideas, to improve the efficiency of algorithm mining rules and promote the application of association rules.

Recently, we have also worked independently of Agrawal's frequency set method [hpy00] to avoid some defects in the frequency set method and explore new methods for mining association rules. Some work [kpr98] focuses on evaluating the value of the mined model, and their models suggest some research directions worth considering.

2 Basic Concepts

Set I = {I1, I2,..., Im} to an item set, where IK (k = 1, 2 ,..., M) It can be an item in the shopping basket or a customer of an insurance company. Set task-related data d to the transaction set, where each transaction T is the item set, so that T & Iacute; I. Set a to an item set and A & Iacute; T.

Association rules are logically contained in the following forms: A & thorn; B, A & igrave; I, A & igrave; I, and a then B = f. Association rules have the following two important attributes:

Support: P (A between B), that is, the probability that both A and B items appear in transaction set D at the same time.

Confidence Level: P (B | A) indicates the probability that item B also appears in transaction set D of item set.

A rule that meets both the minimum support threshold and minimum confidence threshold is called a strong rule. Given a transaction set D, the issue of mining association rules is the issue of generating association rules with a higher level of support and a higher level of trust than the minimum level and minimum level of trust given by the user, that is, the issue of generating strong rules.

3. Association Rule types

1) based on the types of variables processed in the Rules, association rules can be divided into boolean and numeric types.

The values processed by the Boolean association rule are discrete and categorized. It shows the relationship between these variables.

A numeric association rule can be combined with a multi-dimensional association or multi-layer association rule to process a numeric field, dynamically split it, or directly process the original data, of course, numeric association rules can also contain type variables.

2) Abstract layers of data in rules can be divided into single-layer association rules and multi-layer association rules.

In single-layer association rules, all variables do not take into account that the actual data has multiple different layers.

In multi-layer association rules, the multi-layer data has been fully considered.

3) Based on the dimension of the data involved in the rule, the association rules can be divided into single-dimension and multi-dimensional.

In single-dimension association rules, we only involve one dimension of data, such as the items purchased by users.

In multidimensional association rules, the data to be processed involves multiple dimensions.

4. algorithm Overview

4.1 classic frequency set algorithm

Agrawal proposed an important method to mine the association rules between item sets in the customer transaction database in 1994. [as94a, as94b], the core of which is a recursive algorithm based on the two-phase frequency set thinking. This association rule belongs to single-dimension, single-layer, and boolean association rules in classification.

All the items with a higher degree of support than the minimum level are called frequency sets.

4.1.1 basic idea of Algorithms

First, find out all frequency sets. The frequency of these item sets is at least the same as the predefined minimum support. Then, strong association rules are generated by the frequency set. These rules must meet the minimum support and minimum trust level.

The overall performance of mining association rules is determined by the first step, and the second step is relatively easy to implement.

4.1.2 Analysis of Core algorithms of apsaradb for redis

The recursive method is used to generate all frequency sets. Its core idea is briefly described as follows:

(1) L1 = {large 1-itemsets };

(2) For (k = 2; Lk-1 & sup1; F; k ++) Do begin

(3) CK = Apriori-Gen (Lk-1); // new candidate set

(4) for all transactions T & icirc; d do begin

(5) Ct = subset (CK, T); // candidate set contained in transaction t

(6) for all candidates C & icirc; CT do

(7) C. Count ++;

(8) End

(9) LK = {C & icirc; CK | C. Count & sup3; MINSUP}

(10) End

(11) Answer = javasklk;

First, the frequent 1-item set L1 is generated, and then the frequent 2-item set L2 is generated until a certain R value makes LR null, then the algorithm stops. In the k-th loop, the process first generates a set of CK candidates for K-item sets, each item set in CK is produced by making a (Lk-1)-connection to two frequency sets with only one item different. The item set in CK is a candidate set used to generate a frequency set. The LK of the final frequency set must be a subset of CK. Each element in CK needs to be verified in the transaction database to determine whether it is added to lk. The verification process here is a bottleneck of algorithm performance. This method requires that a large transaction database be scanned multiple times. That is, if the frequency set contains up to 10 items, the transaction database should be scanned for 10 times, this requires a lot of I/O load.

A large number of candidate sets may be generated, and the database may need to be scanned repeatedly. These are two major disadvantages of the Apriori algorithm.

4.1.3 Algorithm Optimization

In order to improve the algorithm efficiency, Mannila and others have introduced the trim technology to reduce the CK size of the candidate set [mtv94], which can significantly improve the performance of generating all frequency set algorithms. The pruning policy introduced in the algorithm is based on the following nature: An item set is a frequency set and only when all its subsets are frequency sets. So if a candidate item set in CK has a (k-1)-subset that does not belong to the Lk-1, this item set can be trimmed out of consideration, this trim process can reduce the cost of calculating the support of all candidate sets.

4.2 improved frequency set algorithm

4.2.1 hash

This algorithm was proposed by Park et al. in 1995 [pcy95b]. Experiments show that the primary calculation for finding frequent item sets is to generate frequent two item sets L2. Park introduces the Hash technology to improve the method for generating frequent two item sets.

The basic idea is: when scanning each transaction in the database, when one of the candidates in C1 generates frequent 1 of L1, all two sets of each transaction are generated, they are hashed to different buckets in the hash structure, and the corresponding bucket count is increased, in the hash list, the two items whose bucket count is lower than the support threshold cannot be two frequent items. The two items can be deleted from the two candidate sets, in this way, the two sets to be considered can be greatly compressed.

4.2.2 transaction Compression

Agrawal and other methods to compress the number of transactions for further iterative scanning [as94b, hf95]. Because transactions that do not contain any k-item set cannot contain any (k + 1) item set, the deletion mark can be added to these transactions, so that they are not considered when scanning the database.

4.2.3 Miscellaneous

[Pcy95a] is proposed by Park et al. for an algorithm based on pooled to efficiently generate frequency sets. Through experiments, we can find that the main calculation of the frequency set is to generate frequent 2-item set LK, park and so on are the use of this nature to introduce the miscellaneous technology to improve the method of generating frequent 2-item sets.

4.2.4 Division

Savasere and so on designed a division-based algorithm [son95], which first logically divides the database into several different blocks, each time you consider a single block and generate all the frequency sets for it, then combine the generated frequency sets to generate all possible frequency sets, and finally calculate the support of these sets. The size of the chunks is selected so that each chunk can be put into the primary storage, and each stage only needs to be scanned once. The correctness of the algorithm is ensured by every possible frequency set in at least one block. The algorithms discussed above can be highly parallel. Each segment can be allocated to a processor to generate a frequency set. After each cycle of the generation frequency set ends, the processor communicates to generate a global candidate K-item set. Generally, the communication process is the main bottleneck of Algorithm Execution time. On the other hand, the time for each independent processor to generate a frequency set is also a bottleneck. Other methods also share a collection tree among multiple processors to generate a frequency set. More parallel methods for generating frequency sets can be found in [as96.

4.2.5 sample selection

The basic idea is to mine a subset of a given data. After careful analysis of the information obtained from the previous scan, an improved algorithm can be obtained. Mannila first considers this point [mtv94], they think that sampling is an effective way to discover rules. Later, Toivonen further developed this idea [toi96]. First, it used the samples extracted from the database to obtain some rules that may be established throughout the database, then, verify the result for the rest of the database. The Toivonen algorithm is quite simple and significantly reduces the I/O cost, but a major drawback is that the results are inaccurate, that is, there is a so-called data skew ). Data distributed on the same page is often highly correlated and may not represent the distribution of patterns in the entire database, as a result, the cost of sampling 5% of transaction data may be similar to that of scanning a database.

4.2.6 dynamic item set count

Brin and others give the algorithm [bmut97]. The dynamic item set counting technology divides the database into blocks that mark the start point. Unlike the new candidate set that can be identified only before each complete database scan, the new candidate set can be added at any starting point in this deformation. This technique dynamically evaluates the support of all items to be counted. If all subsets of an item set are identified frequently, add it as a new candidate. The results algorithm requires fewer database scans than the Apriori algorithm.

4.3 FP-tree frequency set algorithm

Aiming at the inherent defects of the Apriori algorithm, J. Han and others proposed a method without generating candidate mining frequent item sets-FP-tree frequency set algorithm [hpy00]. The divide-and-conquer policy is adopted. After the first scan, the frequency set in the database is compressed into a frequent mode tree (FP-tree), and the Association information is retained, then, the FP-tree is divided into several condition libraries, each of which is related to a frequency set with a length of 1, and then these condition libraries are mined separately. When the raw data volume is large, you can also combine the partitioning method to make a FP-tree available in the primary storage. Experiments show that FP-growth has good adaptability to rules of different lengths, and its efficiency is greatly improved compared with that of the Apriori algorithm.

4.4 multi-layer Association Rule Mining

For many applications, due to the scattered data distribution, it is difficult to find some strong association rules at the level of data details. After introducing the conceptual hierarchy, we can mine [hf95, sa95] at a higher level. Although the rules obtained at a higher level may be more common information, it is common information for one user, but not necessarily so for another user. Therefore, data mining should provide such a function for mining at multiple levels.

Classification of multi-layer association rules: According to the layers involved in the Rules, multi-layer association rules can be divided into same-layer association rules and inter-layer association rules.

The multi-layer association rule mining can basically follow the "support-Credibility" framework. However, there are some things to consider when setting the support level.

The same-layer association rules can adopt two support policies:

1) unified minimum support. The same minimum support is used for different levels. This is easier for users and algorithm implementations, but the drawbacks are also obvious.

2) Minimum Support for decline. Each layer has different minimum support levels, and the minimum support levels at lower levels are relatively small. At the same time, you can also use the information obtained from the upper layer for filtering.

When considering the minimum level of support, the minimum level of support should be determined based on the lower level of support.

4.5 multi-dimensional association rule mining

In addition to dimension association rules, multidimensional databases also have a type of multidimensional association rules. For example:

Age (x, "20... 30") Occupation (x, "student") ==> purchase (x, "laptop ")

Here we will involve three dimensions of data: age, occupation, and purchase.

Based on whether the same dimension can be repeated or not, it can be subdivided into association rules between dimensions (repeated dimensions are not allowed) and hybrid dimension association rules (the dimension can appear at both the left and right of the rule ).

Age (X, "20 .. 30") Purchase (X, "laptop") ==> purchase (X, "printer ")

This rule is a hybrid dimension association rule.

When mining association rules between dimensions and hybrid dimension, we also need to consider different field types: type and numeric type.

For fields of different types, the original algorithm can process them. For numeric fields, the [khc97] can be performed only after some processing is performed. The following methods are used to process numeric fields:

1) numeric fields are divided into predefined hierarchies. These intervals are pre-defined by the user. The obtained rule is also called a static quantity association rule.

2) numeric fields are divided into boolean fields based on data distribution. Each Boolean field represents the range of a value field. If it falls into it, it is 1, and if it falls into it, it is 0. This method is dynamic. The obtained rule is a Boolean join rule.

3) A value field is divided into intervals that reflect its meaning. It considers the distance between data. The obtained rule is a distance-based association rule.

4) directly use the raw data in the Value Field for analysis. Use statistical methods to analyze the value of a numeric field, and combine the multi-layer association rules to compare multiple levels to obtain some useful rules. The resulting rule is called a multi-layer quantity association rule.

5 Outlook

For the development of association rule mining, the author believes that in-depth research can be conducted in the following areas: how to improve algorithm efficiency when processing a large amount of data; further research on mining algorithms for quickly updating data; in the process of mining, it provides a method for interacting with users, integrating users' domain knowledge; for the handling of numeric fields in association rules, the visualization of the generated results, and so on.

References

[Ais93b] r. agrawal, T. imielinski, and. swami. mining Association Rules between sets of items in large databases. proceedings of the ACM sigmod Conference on management of data, P. p. 207-216, May 1993.

[As94a] R. Agrawal, and R. srikant. Fast Algorithms for mining association rules in large database. Technical Report fj9839, IBM Almaden Research Center, San Jose, CA, Jun. 1994.

[As94b] R. Agrawal, and R. srikant. Fast Algorithms for mining association rules. In Proc. 1994 Int. conf. Very large databases (vldb '94), SEP. 1994.

[As96] r. agrawal, and J. shafer. parallel mining of Association Rules: design, implementation, and experience. IEEE Trans. knowledge and Data Engineering, 8: 962-969, Jan. 1996.

[Bmut97] S. brin, R. motwani, J. d. ullman, and S. tsur. dynamic itemset counting and implication rules for market basket data. in ACM sigmod International conference on the management of data, P. p. 255-264, May 1997.

[Hf95] J, Han and Y. fu. discovery of multiple-level association rules from large databases. in Proc. 1995 Int. conf. very large databases (vldb '95), P. p. 402-431, SEP. 1995.

[Hpy00] J. han, J. pei, and Y. yin. mining. frequent Patterns without candidate generation. in Proc. 2000 ACM-SIGMOD Int. conf. management of data (sigmod '00), P. p. 1-12, May 2000.

[Khc97] M. kamber, J. han, J. chang. metarule-guided mining of multi-dimensional association rules using data cubes. in Proc. 1997 Int. conf. khowledge discovery and data mining (kdd'97), P. p. 207-210, Aug. 1997

[Kpr98] J. kleberger, C. Papadimitriou, and P. Raghavan. Segmentation problems. Proceedings of the 30th Annual Symposium on Theory of computing, ACM. sep. 1998.

[Mtv94] H. mannila, H. toivonen, and. verkamo. efficient Algorithm for discovering association rules. aaai Workshop on Knowledge Discovery in databases, P. p. 181-192, Jul. 1994.

[Pcy95a] J. s. park, M. s. chen, and P. s. yu. an aggressive hash-Based Algorithm for mining association rules. proceedings of ACM sigmod International Conference on management of data, P. p. 175-186, May 1995.

[Pcy95b] J. s. park, M. s. chen, and P. s. yu. efficient Parallel Data Mining of association rules. 4th International Conference on Information and Knowledge Management, Baltimore, Maryland, Nov. 1995.

[Sa95] R. srikant, and R. Agrawal. Mining Generalized Association Rules. Proceedings of the 21st International Conference on very large database, p. p. 407-419, Sep.1995.

[Son95]. savasere, E. omiecinski, and S. navathe. an Efficient Algorithm for mining association rules in large databases. proceedings of the 21st International Conference on very large database, P. p. 432-443, SEP. 1995.

[Toi96] H. toivonen. sampling large databases for association rules. proceedings of the 22nd International Conference on very large database, Bombay, India, p. p. 134-145, SEP. 1996.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.