Frequent pattern mining Apriori

Source: Internet
Author: User

Original address: http://blog.sina.com.cn/s/blog_6a17628d0100v83b.html


1. Mining Association Rules

1.1 What is association rule

In a nutshell, the association rule is a form of x→y implication, which means that "get" Y can be deduced through X, where x and y are referred to as the precursor of association rules (antecedent or Left-hand-side, LHS) and successors ( Consequent or right-hand-side, RHS)

1.2 How to quantify association rules

A typical example of association rule mining is shopping cart analysis. Mining Association rules can find the relationship between different items in the shopping cart, and analyze the consumer habits of customers. The direction of this association rule can help sellers understand which products are frequently purchased by customers and help them develop better marketing strategies. For example: Buy goods that are often purchased at the same time, to further stimulate the sale of these goods together, or to put two of items that are often purchased at the same time far away, which may induce users who buy the two items to pick other items.

In data mining, we usually use the concepts of "support degree" (support) and "placement degree" (confidence) to quantify the association rules between things. They reflect the usefulness and certainty of the discovered rules, respectively. Like what:

Computer => Antivirus_software, of which support=2%, confidence=60%

It means that 2% of all commodity transactions buy computers and antivirus software, and 60% of customers who buy computers also buy antivirus software. In the mining of association rules, the minimum support threshold and the minimum threshold are usually set, and if an association rule satisfies the minimum support threshold and the minimum threshold, it is considered that the rule can bring interested information to the user.

1.3 Association Rule Mining process

1) Several basic concepts:

The support degree support=p (AB) of the association rule A->b refers to the probability that event A and event B occur simultaneously.

Confidence degree Confidence=p (b| A) =p (AB)/P (a) refers to the probability of event B occurring on the basis of event A.

Rules that meet the minimum support threshold and minimum confidence threshold are called strong rules.

If event A contains k elements, then this event A is called the K set, and event a meets the minimum support threshold for events called frequent K itemsets.

2) Mining Process:

First, find all the frequent itemsets;

Second, strong rules are generated by frequent itemsets.

2. What is Apriori

2.1 Apriori Introduction

The Apriori algorithm uses a priori knowledge of frequent itemsets, uses an iterative method called layered search, and K sets for exploring (k+1) itemsets. First, by scanning transaction (transaction) records, find all the frequent 1 sets, the collection of L1, and then use L1 find frequent set of 2 sets L2,l2 find L3, so continue until no longer find any frequent K-itemsets. Finally, the strong rule is found in all the frequent centralization, that is to generate the association rules interested by the users.

The Apriori algorithm has the property that all non-empty sets of any frequent itemsets must also be frequent. Because if P (I) < minimum support threshold is added, the result item set (A∩I) is not likely to appear more frequently than I if there is an element a to add to I. So a∩i is also not frequent.

2.2 Connection steps and pruning steps

In the two steps of the above association rule mining process, the first step is often the bottleneck of overall performance. The Apriori algorithm uses connection steps and pruning steps to find all the frequent itemsets.

1) Connection Step

To find LK (the set of all frequent k sets), a set of candidate K sets is generated by connecting Lk-1 (all the sets of frequent k-1 sets) to itself. The candidate set is recorded as CK. Set L1 and L2 are members of the Lk-1. Li[j] Represents the J item in Li. Suppose the Apriori algorithm sorts the items in a transaction or set of items in dictionary order, that is, for the (k-1) item set LI,LI[1]<LI[2]<..........<LI[K-1]. Connect Lk-1 to itself if (L1[1]=l2[1]) && (l1[2]=l2[2]) &&........&& (l1[k-2]=l2[k-2)) && (l1[k-1) <l2[k-1]), that considers L1 and L2 to be connected. The result of connection L1 and L2 is {l1[1],l1[2],......, l1[k-1],l2[k-1]}.

2) Pruning steps

CK is a superset of LK, which means that the members of CK may or may not be frequent. By scanning all transactions (transactions), the count of each candidate in CK is determined to determine whether it is less than the minimum support count, and if not, the candidate is considered frequent. To compress CK, you can use the Apriori property: All the non-empty sets of any frequent itemsets must also be frequent, whereas if a candidate's non-empty set is not frequent, the candidate is certainly not frequent, so that it can be removed from CK.

(Tip: Why do you compress CK?) Because the actual situation of transaction records are often stored on the external storage, such as databases or other formats of the file, in each calculation of the candidate count will need to compare the candidate with all transactions, it is known that access to the external memory is often relatively low efficiency, so Apriori joined the so-called pruning step, The candidate sets are filtered in advance to reduce the number of accesses to the external memory. )

2.3 Apriori Algorithm Example

Transaction ID

Product ID List

T100

I1,i2,i5

T200

I2,i4

T300

I2,i3

T400

I1,i2,i4

T500

I1,i3

T600

I2,i3

T700

I1,i3

T800

I1,i2,i3,i5

T900

I1,i2,i3

The above figure is a shopping mall transactions, a total of 9 transactions, using the Apriori algorithm to find all the frequent itemsets process is as follows:


The C3 of the set of candidate 3 sets is introduced in detail: from the join step, first C3={{I1,I2,I3},{I1,I2,I5},{I1,I3,I5},{I2,I3,I4},{I2,I3,I5},{I2,I4,I5}} (C3 is generated by the L2 connection to itself). Depending on the nature of the Apriori, all subsets of the frequent itemsets must also be frequent, and it is possible to determine that there are 4 candidate sets {I1,I3,I5},{I2,I3,I4},{I2,I3,I5},{I2,I4,I5}} that are not likely to be frequent because they have subsets that do not belong to frequent sets. Therefore, they are removed from the C3. Note that since the Apriori algorithm uses layered search techniques, given the candidate K sets, it is only necessary to check whether their (k-1) subsets are frequent.

3. Apriori Pseudo Code

Algorithm: Apriori

Input: D-transaction database; min_sup-Minimum support count threshold

Output: Frequent itemsets in l-d

Method:

L1=find_frequent_1-itemsets (D); Find out all the frequent 1 episodes

for (k=2; lk-1!=null;k++) {

Ck=apriori_gen (Lk-1); Produce candidates, and prune

For each transaction t in d{//scan D for a candidate count

Ct =subset (ck,t); Get a subset of T

For each candidate C belongs to the Ct

c.count++;

}

Lk={c belongs to CK | c.count>=min_sup}

}

Return l= all frequent sets;

Procedure Apriori_gen (lk-1:frequent (k-1)-itemsets)

For each item set L1 belongs to Lk-1

For each item set L2 belongs to Lk-1

If ((l1[1]=l2[1]) && (l1[2]=l2[2]) &&

&& (L1[k-2]=l2[k-2]) && (l1[k-1]<l2[k-1]) then{

C=L1 Connection L2//Connection step: Generating candidate

If Has_infrequent_subset (c,lk-1) Then

Delete C; Pruning step: Removing infrequent candidates

else add c to Ck;

}

return Ck;

Procedure has_infrequent_sub (C:candidate k-itemset; Lk-1:frequent (k-1)-itemsets)

For each (k-1)-subset S of C

If s does not belong to Lk-1 then

return true;

return false;

4. Generating association rules from frequent itemsets

Confidence (a->b) =p (b| A) =support_count (AB)/support_count (a)

Association rule Generation steps are as follows:

1) for each frequent itemsets L, produces all of its non-empty true subsets;

2 for each non-null true subset s, if Support_count (L)/support_count (s) >=min_conf, then output s-> (l-s), where min_conf is the minimum confidence threshold.

For example, in the example above, for frequent set {I1,I2,I5}. What association rules can be generated. The Non-empty true subset of the frequent set has {I1,i2},{i1,i5},{i2,i5},{i1},{i2} and {I5}, and the corresponding confidence level is as follows:

I1&&i2->i5 confidence=2/4=50%

I1&&i5->i2 confidence=2/2=100%

I2&&i5->i1 confidence=2/2=100%

I1->i2&&i5 confidence=2/6=33%

I2->i1&&i5 confidence=2/7=29%

I5->i1&&i2 confidence=2/2=100%

If min_conf=70%, the strong rule has i1&&i5->i2,i2&&i5->i1,i5->i1&&i2.

5. Apriori Java code

Package Com.apriori;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.