Data mining algorithms-AssociationRule (Shopping Basket Analysis)

Source: Internet
Author: User
In various data mining algorithms, association rule mining is an important one, especially influenced by basket analysis. association rules are applied to many real businesses, this article makes a small Summary of association rule mining. First, like clustering algorithms, association rule mining is an unsupervised learning method that describes the simultaneous appearance of items in a single thing.

In various data mining algorithms, association rule mining is an important one, especially influenced by basket analysis. association rules are applied to many real businesses, this article makes a small Summary of association rule mining. First, like clustering algorithms, association rule mining is an unsupervised learning method that describes the simultaneous appearance of items in a single thing.

In various data mining algorithms, association rule mining is an important one, especially influenced by basket analysis. association rules are applied to many real businesses, this article makes a small Summary of association rule mining. First, like clustering algorithms, association rule mining is an unsupervised learning method that describes the patterns of knowledge that appear between items in a single thing. In real life, for example, when shopping in supermarkets, customers' purchase records often imply many association rules. For example, 65% of the customers who buy ballpoint pens also purchase laptops, mall personnel can well plan commodity placement issues. For the sake of convenience, set R = {I1, I2 ...... im} is a group of items, and W is a group of transactions. Every transaction T in W is a group of items and T is a subset of R. Assume that there is an item set A, A transaction T, and the association rules are contained in the following form: A → B, where A and B are two groups of items, and A belongs to the subset of I, B belongs to the subset of I. Design four common key indicators in Association Rules 1. confidence)

Definition:In transactions where item set A is supported by item set W, c % transactions also support item Set B. c % is called the reliability of Association Rule A → B.

General explanation:In short, Credibility refers to the probability that item Set B appears at the same time in transaction T of item set.

Instance description:In the example of the ballpoint pen and notebook mentioned above, the credibility of the association rule answers the following question: if a customer buys a ballpoint pen, how likely is he to buy a notebook? In the above example, 65% of the customers who buy a ballpoint pen buy a notebook, so the reliability is 65%.

Probability description:Confidence (A => B) = P (A | B)

2. support)

Definition:Transactions with s % in W support both item set A and item Set B. s % is the support level of Association Rule A → B. The degree of support describes the probability that the Union of item set A and item Set B appears in all transactions.

General explanation:To put it simply, A => B's support refers to the probability that item set A and item Set B appear at the same time.

Instance description:One day, a total of 1000 customers went to the mall to buy items, of which 150 bought both ballpoint pens and laptops, and the above association rules supported 15%.

Probability description:Support for item set A to item Set B (A => B) = P (A n B)

3. Expected confidence level (Expected confidence)

Definition:Assume that transactions with e % in W support item Set B, and e % is the expected reliability level of Association Rule A → B.

General explanation:The expected credibility describes how likely Item Set B appears in all transactions without any conditional impact.

Instance description:If a total of 1000 customers go to the mall to purchase items one day, and 250 of them purchase ballpoint pens, the above association rules will expect 25% reliability.

Probability description:Item Set A's expected confidence level for item Set B is support (B) = P (B)

4. lift)

Definition:Improvement is the ratio of credibility to expected credibility.

General explanation:The increase reflects the change in the appearance probability of item set A to item Set B.

Instance description:Improvement of the above association rules = 65%/25% = 2.6

Probability description:Item Set A's expected confidence level for item Set B is lift (A => B) = confidence (A => B)/support (B) = p (B |) /p (B)

In short, credibility is a measure of the accuracy of association rules, and support is a measure of the importance of association rules. The degree of support indicates the representativeness of this rule in all transactions. Obviously, the higher the degree of support, the more important the association rule is. Although some association rules have a high level of credibility, their support is low, indicating that the chance of applying the association rules is very small, so it is not important.

In Association Rule Mining, a set that meets a certain minimum confidence level and support level becomes a frequent set (frequent itemset) or strongly correlated. Association rule mining is a process of searching for frequent sets.

Related Algorithms for association rule mining

1. Apriori algorithm: Use candidate item sets to find frequent item sets

The Apriori algorithm is the most influential algorithm used to mine frequent item sets of Boolean association rules. Its core is a recursive algorithm based on the two-phase frequency set thinking. This association rule belongs to single-dimension, single-layer, and boolean association rules in classification. Here, all the item sets with a higher degree of support than the minimum level are called frequent item sets.

The basic idea of this algorithm is: first, find all frequency sets. The frequency of these item sets is at least the same as the predefined minimum support. Then, strong association rules are generated by the frequency set. These rules must meet the minimum support and minimum trust level. Then, use the frequency set found in step 1 to generate the expected rules and generate all rules containing only the items in the set. Each rule has only one rule on the right, the rule definition is used here. Once these rules are generated, only those rules that are greater than the minimum credibility given by the user are left. The recursive method is used to generate all frequency sets.

A large number of candidate sets may be generated, and the database may need to be scanned repeatedly. These are two major disadvantages of the Apriori algorithm.

2. Division-based algorithms

Savasere and so on designed a division-based algorithm. This algorithm first logically divides the database into several segments that do not match each other. Each time we consider a block separately and generate all the frequency sets for it, and then combine the resulting frequency sets, used to generate all possible frequency sets, and finally calculate the support of these item sets. The size of the chunks is selected so that each chunk can be put into the primary storage, and each stage only needs to be scanned once. The correctness of the algorithm is ensured by every possible frequency set in at least one block. This algorithm can be highly parallel, and each part can be allocated to a processor to generate a frequency set. After each cycle of the generation frequency set ends, the processor communicates to generate a global candidate k-item set. Generally, the communication process is the main bottleneck of Algorithm Execution time. On the other hand, the time for each independent processor to generate a frequency set is also a bottleneck.

3. FP-tree frequency set algorithm

Aiming at the inherent defects of the Apriori algorithm, J. Han and others proposed a method that does not generate frequent item sets for candidate Mining: FP-tree frequency set algorithm. The divide-and-conquer policy is adopted. After the first scan, the frequency set in the database is compressed into a frequent mode tree (FP-tree), and the Association information is retained, then, the FP-tree is divided into several condition libraries, each of which is related to a frequency set with a length of 1, and then these condition libraries are mined separately. When the raw data volume is large, you can also combine the partitioning method to make a FP-tree available in the primary storage. Experiments show that FP-growth has good adaptability to rules of different lengths, and its efficiency is greatly improved compared with that of the Apriori algorithm.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.