In various data mining algorithms, association rule mining is an important one, especially influenced by basket analysis. association rules are applied to many real businesses, this article makes a small Summary of association rule mining. First, like clustering algorithms, association rule mining is an unsupervised learning method that describes the patterns of knowledge that appear between items in a single thing. In real life, for example, when shopping in supermarkets, customers' purchase records often imply many association rules. For example, 65% of the customers who buy ballpoint pens also purchase laptops, mall personnel can well plan commodity placement issues. For the sake of convenience, set R = {I1, I2 ...... im} Is a group of items, W Is a set of transactions. W Every transaction t in Is a group of items, and T is a subset of R. Assume that there is an item set a, a transaction T, and association rules are contained in the following forms: A → B, where A and B Are two groups of items, a belongs to the I subset, and B belongs to the I subset. Design four common key indicators in Association Rules
1. Confidence)
Definition:In the transaction where item set a is supported in item set w, c% Transactions also support item set B, c% Is called Association Rule A → BReliability.
General explanation:In short, Credibility refers to the emergence of item set.Transaction tItem Set BWhat is the probability of simultaneous occurrence.
Instance description:In the example of the ballpoint pen and notebook mentioned above, the credibility of the association rule answers the following question: if a customer buys a ballpoint pen, how likely is he to buy a notebook? In the above example, 65% of the customers who buy a ballpoint pen bought a notebook,Therefore, the reliability is 65%.
Probability description:Confidence (A => B) = P (A | B)
2. Support)
Definition:Set WS in it% Transactions support both item setAnd B, S% Is called Association Rule A → B. The support level describes.And BThe union of the two item sets CWhat is the probability of occurrence in all transactions.
General explanation:To put it simply, A => B's support refers to the probability that item set a and item Set B appear at the same time.
Instance description:One day, a total of 1000 customers went to the mall to buy items, of which 150 bought both ballpoint pens and laptops, and the above association rules supported 15%.
Probability description:Support for item set a to item Set B (A => B) = P (A n B)
3. Expected confidence level (expected confidence)
Definition:Set WE In% Of transactions support item Set B, E% Is called Association Rule A → B.
General explanation:Expected credibility describes Item Set B without any influenceWhat is the probability of occurrence in all transactions.
Instance description:If there are 1000Customers go to the mall to buy items, of which 250When a user buys a ballpoint pen, the expected reliability of the association rules is 25%.
Probability description:Item Set A's expected confidence level for item Set B is support (B) = P (B)
4. Lift)
Definition:Improvement is the ratio of credibility to expected credibility.
General explanation:The increase reflects the change in the appearance probability of item set a to item Set B.
Instance description:Improvement of the above association rules = 65%/25% = 2.6
Probability description:Item Set A's expected confidence level for item Set B is lift (A => B) = confidence (A => B)/support (B) = P (B |) /P (B)
In short, credibility is a measure of the accuracy of association rules, and support is a measure of the importance of association rules. The degree of support indicates the representativeness of this rule in all transactions. Obviously, the higher the degree of support, the more important the association rule is. Although some association rules have a high level of credibility, their support is low, indicating that the chance of applying the association rules is very small, so it is not important.
In Association Rule Mining, a set that meets a certain minimum confidence level and support level becomes a frequent set (frequent itemset) or strongly correlated. Association rule mining is a process of searching for frequent sets.
Related Algorithms for association rule mining
1. Apriori algorithm: Use candidate item sets to find frequent item sets
The Apriori algorithm is the most influential algorithm used to mine frequent item sets of Boolean association rules. Its core is a recursive algorithm based on the two-phase frequency set thinking. This association rule belongs to single-dimension, single-layer, and boolean association rules in classification. Here, all the item sets with a higher degree of support than the minimum level are called frequent item sets.
The basic idea of this algorithm is: first, find all frequency sets. The frequency of these item sets is at least the same as the predefined minimum support. Then, strong association rules are generated by the frequency set. These rules must meet the minimum support and minimum trust level. Then, use the frequency set found in step 1 to generate the expected rules and generate all rules containing only the items in the set. Each rule has only one rule on the right, the rule definition is used here. Once these rules are generated, only those rules that are greater than the minimum credibility given by the user are left. The recursive method is used to generate all frequency sets.
A large number of candidate sets may be generated, and the database may need to be scanned repeatedly. These are two major disadvantages of the Apriori algorithm.
2. Division-based algorithms
Savasere and so on designed a division-based algorithm. This algorithm first logically divides the database into several segments that do not match each other. Each time we consider a block separately and generate all the frequency sets for it, and then combine the resulting frequency sets, used to generate all possible frequency sets, and finally calculate the support of these item sets. The size of the chunks is selected so that each chunk can be put into the primary storage, and each stage only needs to be scanned once. The correctness of the algorithm is ensured by every possible frequency set in at least one block. This algorithm can be highly parallel, and each part can be allocated to a processor to generate a frequency set. After each cycle of the generation frequency set ends, the processor communicates to generate a global candidate K-item set. Generally, the communication process is the main bottleneck of Algorithm Execution time. On the other hand, the time for each independent processor to generate a frequency set is also a bottleneck.
3. FP-tree frequency set algorithm
Aiming at the inherent defects of the Apriori algorithm, J. Han and others proposed a method that does not generate frequent item sets for candidate Mining: FP-tree frequency set algorithm. The divide-and-conquer policy is adopted. After the first scan, the frequency set in the database is compressed into a frequent mode tree (FP-tree), and the Association information is retained, then, the FP-tree is divided into several condition libraries, each of which is related to a frequency set with a length of 1, and then these condition libraries are mined separately. When the raw data volume is large, you can also combine the partitioning method to make a FP-tree available in the primary storage. Experiments show that FP-growth has good adaptability to rules of different lengths, and its efficiency is greatly improved compared with that of the Apriori algorithm.
Data mining algorithms-Association Rule Mining (Shopping Basket Analysis)