Mining Association rules of Data Mining Algorithm (a)---apriori algorithm

Source: Internet
Author: User

The Application of association rule Mining algorithm in life is everywhere, it can be seen in almost every e-commerce website.

To give a simple example

such as Dangdang, when you browse a book, you can see some package recommendations on the page, book + related books 1+ related books 2+...+ Other items = How many ¥

And these packages are likely to suit your appetite, and you might have bought a whole package for this recommendation.


This is different from USERCF and ITEMCF, the first two are recommended similar, or you may like the list of products

and the Association rules mining is n products are not often bought together, if it is, that n goods, there is a product is being browsed (there is the intention to buy), then this time the system is not appropriate to the other N-1 products recommended to this user, Because many other users buy this product when they will buy other n-1 products, the n products into a package offer, is not to promote consumption?

The relationship between these n items (often purchased by the user) is an association rule


Here is a relatively simple association rule algorithm---Apriori

First introduce a few professional nouns

Mining Datasets: The collection of data to be mined. That's a good understanding.

Frequent patterns: Patterns that occur frequently in mining datasets, such as itemsets, sub-structures, sub-sequences, and so on. This is how to understand, in short, mining data sets, a number of frequently occurring subset of data

Association rules: For example, milk = egg {Support degree =2%, confidence =60%}. Association rules represent the relationship between A and B items, expressed in terms of support and confidence (of course, not only between two items, but also between n items), and the size of the values defined by the support and confidence level will affect the performance of the entire algorithm.

Support: In the example above, the degree of support indicates the percentage of users who have purchased milk and eggs together. Support has a predefined initial value (as in the previous example 2%), if the final support is less than the initial value, then the milk and eggs can not become a frequent pattern

Confidence level: In the example above, the confidence indicates the percentage of users who purchased the milk and who bought the eggs. As with support, the confidence will also have an initial value (60% in the above example, indicating that 60% of the users who purchased the milk also purchased the eggs), if the final confidence level is less than the initial value, then milk and eggs can not become a frequent mode

Support and confidence can also be represented by specific data, not necessarily a percentage


The basic idea of the Apriori algorithm is that in a frequent pattern with N, all its subsets are also frequent patterns

Let's look at an example of a shopping cart data


The TID represents the number of the shopping cart, each line represents the corresponding list of items in the shopping cart, and the product i1,i2,i3,i4,i5,d represents the entire data sheet.


The working process of the Apriori algorithm is as follows:


(1) First scan the entire data table D, calculate the support degree of each product (the number of occurrences), to obtain the candidate C1 table. Here, each individual commodity is treated as a frequent pattern to calculate its support

(2) Compare the support degree of each product with the minimum support degree (minimum support is 2), and the goods less than 2 will be filtered to get L1. Each product has a support level greater than 2, so it's all reserved.

(3) The L1 and its own natural connection operation, to obtain the candidate C2 table. That is, the l1*l1 operation, the L1 is fully arranged, remove the duplicate rows to get the candidate C2 (such as {I1,I1},{I2,I2}, etc.), each item in C2 is made up of two items

(4) Scan the entire table D again to calculate the support level for each row in the C2. Each row in the C2 (two items) is counted as a frequent mode calculation support

(5) Compare each support degree and minimum support degree 2 in C2, filter, get L2.

(6) The candidate C3 is obtained by making the natural connection between L2 and itself. The result of the L2*L2 is: {i1,i2,i3},{i1,i2,i5}{i1.i3,i5}{i2,i3,i4}{i2,i3,i5}{i2,i4,i5},{i1,i2} and {I1,i3} result is {I1,i2,i3}, The calculation is: The first n-1 items must be consistent (that is, i1), the result is the first n-1 items + their nth item (I2,I3). So why is there only {i1,i2,i3},{i1,i2,i5} in the C3, looking back at the basic idea of Apriori algorithm, if the third {i1,i3,i5} is also a frequent pattern, then all its subsets should also be frequent patterns, and in L2 cannot find {i3, i5} This item, so {I1.I3,I5} is not a frequent pattern, filter. The end result is C3.

(7) Scan the entire table D again to calculate the support level for each row in the C3. Each row in the C3 (three items) is counted as a frequent mode calculation support

(8) Compare each support degree and minimum support degree 2 in C3, filter, get L3

Because the maximum number of items in table D is 4 and only occurs once, it cannot be a frequent pattern, so the frequent patterns that are calculated to three items can be ended

The output of the algorithm should be; 1,l2,l3 collection, where each item is a frequent pattern

For example, we get a frequent pattern {I1,I2,I3}, which association rules can be extracted?

{I1,i2}=>i3, which represents the percentage of users who have purchased I1,i2 and who have purchased i3. {I1,I2,I3} has a number of occurrences of 2,{i1,i2}, so the confidence level is 2/4=50%

Similarly, it can be calculated

{i1,i3}=>i2,confidence=50%

{i2,i3}=>i1,confidence=50%

i1=>{i2,i3},confidence=33%

i2=>{i1,i3},confidence=28%

i3=>{i1,i2},confidence=33%


That is, when a user buys a i1,i3, the system can refer i2 together as a package to the user, as these three items are frequently purchased together.


However, through the description of the entire process of the algorithm, we can see that the Apriori algorithm in the calculation above the simple example of 3 full-table scan, and in the L1 natural connection, if the shopping cart item data is very large (such as 100), this time the calculation of natural connection operation is huge, Memory cannot load such a large amount of data

So the Apriori algorithm is now very rarely used, but by understanding the Apriori algorithm can let us learn more about association rules mining, and can be used as a comparative basis, and other association rules algorithm to do the comparison, so that which algorithm performance good, fortunately where


Reference book: "Data Mining concepts and technologies"

Mining Association rules of Data Mining Algorithm (a)---apriori algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.