Principle analysis and code implementation of Apriori correlation analysis algorithm

Last Update:2017-10-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transfer from Mu Chen

Read Catalogue

Objective
Some concepts in the field of correlation analysis
Fundamentals of Apriori Algorithms
Implementation idea and implementation code of frequent item set retrieval
The realization and implementation Code of association rule Learning
Summary

Back to the top of the preface

Presumably everyone has heard the classic story of the field of data mining-the story of "beer and diapers".

So, how is it that the relationship between beer and diapers is dug out of the mass sales information?

This is the task that the correlation analysis is going to accomplish.

This article will explain the most classical Apriori algorithm in the field of association analysis, and give the concrete code implementation.

1. frequent itemsets: A collection of items that are often present together in a data set. such as "beer and diapers"

2. Association Rules : refers to the existence of a strong relationship between two sets of items. For example, "{beer} and {diaper}" is an association rule.

3. Support : Data set, the amount of data items in an item set is the proportion of the total (in some cases also interpreted as the number of times).

4. credibility : This concept is based on an association rule. It refers to the ratio of support for two sets of items and the degree of support for one of these items, such as "Support (beer, diaper}/support {diaper}").

So, using these to explain the beer and diaper story, that is: {beer, diapers} is a frequent item set; "{beer} and {diaper}" is an association rule; the likelihood of a customer buying a diaper while buying a beer is "support" (Beer, diaper}/support) {diaper}.

What if you want to get all the frequent itemsets with a support level greater than 0.8 for a huge amount of data?

If you use brute force method of statistics, is not realistic, so the amount of computation is too large.

The significance of the Apriori Association analysis algorithm in this paper is to reduce the computational amount of the case greatly and retrieve the association rules efficiently from the frequent itemsets, thus greatly reducing the computational amount that the association rule learning needs to consume.

If {0,1} is a frequent itemsets, then {0} and {1} are also frequent itemsets.

This is clearly the right proposition.

Its inverse no proposition-"if {0} and {1} are not all frequent itemsets, then {0,1} is not a frequent itemsets" is naturally correct. This is one of the core ideas of the Apriori algorithm.

Thus, once a collection is found not to be a frequent itemsets, then all its superset is not a frequent itemsets, so it is not wasted effort to retrieve them.

After retrieving the frequent itemsets, the next step is to retrieve all the required association rules.

If a rule does not meet the minimum confidence requirement, then all subsets of the rule will not be satisfied. This is another part of the core idea of the Apriori algorithm.

PS: It is important to think about how the rules are divided and what is called a subset of the rules.

In this way, as with the previous step, it is also possible to efficiently retrieve association rules from frequent itemsets.

The specific implementation will be divided into frequent sets of search and association rules Learning two parts to explain.

A classic implementation method is the "grading method":

The algorithm framework is iterative between the two big steps-filtering/filtering-until all the scenarios have been analyzed.

The result of each filter is to specify the number of elements, so the so-called grading, that is, to discuss the number of specified elements of the set of items.

After filtering, it is filtering.

Filtering has two meanings, and one is that the itemsets must exist in the data set. This is the first layer of filtering; there is also a layer of filtering that refers to the support filtering. Only itemsets that meet the support requirements can be saved.

After filtering, filtering is based on the filter set, and the number of elements per filter is one more than the number of elements that were last filtered.

Then continue filtering. So repeatedly, until the last filter filter is complete.

Pseudo-code implementation:

1 when the number of items in the collection is greater than 0 o'clock: 2     Build a list of candidate itemsets consisting of K items 3     Examine the data to confirm that each itemsets is frequent 4     keep frequent itemsets and build a list of candidate itemsets consisting of k+1 items

The pseudo-code that checks if each item set is frequently part of:

1 pairs per transaction in the DataSet: 2     for each candidate set element: 3         checks whether the dataset element is reserved. 4 for each data set 5     If the support level is up to 6 return filtered frequent itemsets-also filter set

Python implementation and test code:

  1 def loaddataset (): 2 ' Return test Data ' 3 4 return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]] 5 6 #======== =========================== 7 # Input: 8 # DataSet: DataSet 9 # Output: # map (Frozenset, C1): Candidate Set 11 #=========== ======================== def createC1 (DataSet): 13 ' Create candidate set ' [] C1 = [] All-in-transaction in dataset                  : + for item in TRANSACTION:18 if not [item] in c1:19 c1.append ([item]) 20 C1.sort () 22 23 # The returned collection element is the Frozenset type, because it is used later to make the key.  return map (Frozenset, C1) 25 26 #============================================= 27 # Input: # D: DataSet (set format) # Ck: Candidate Set # Minsupport: Minimum support 31 # Output: # Retlist: Filter Set # Supportdata: Support Set (when Mining association rules makes #============================================= Scand (D, Ck, Minsupport): 36 ' Get filtered set by candidate set ' 37 38 # Count the number of candidate elements that occur sscnt = {}-for-tid in-d:41 for Can in ck:42 if Can.issubset (TID): An If not Sscnt.has_key (CAN): Sscnt[can]=1 44     Else:sscnt[can] + = 1 45 46 # Build filter set and support set NumItems = float (len (D)) retlist = [] 49 Supportdata = {} for key in sscnt:51 support = Sscnt[key]/numitems-if support >= Minsuppo rt:53 Retlist.insert (0,key) Supportdata[key] = support for the return retlist, SUPPORTD        ATA 57 58 #=================================== 59 # Input: # Lk: Filter Set # K: number of elements in current itemsets 62 # output: 63 # Retlist: Candidate Set #=================================== Apriorigen (Lk, K): 66 ' candidate set by filter set ' 67 68 # Recombination filter set, get To the new candidate set.             Retlist = [] lenlk = Len (Lk) 73 for I in Range (LENLK): $ for J in range (I+1, LENLK): # Pay attention to the reorganization techniques L1 = List (Lk[i]) [: k-2];  L2 = List (Lk[j]) [: K-2] L1.sort ();      L2.sort () 77       If l1==l2:78 retlist.append (Lk[i] |        LK[J]) Retlist return 81 82 #============================================= 83 # Input: 84 # DataSet: Data Set # Minsupport: Minimum Support 86 # output: # # L: Frequent set # Supportdata: Support Set (used when mining Association rules) 89 #===== ======================================== def apriori (dataSet, minsupport = 0.5): 91 ' finding frequent itemsets and their corresponding support degrees '     1 = createC1 (DataSet) 94 D = Map (set, DataSet) L1, Supportdata = Scand (D, C1, minsupport)-L = [L1] 97 K = 2 98 while (Len (l[k-2]) > 0): Ck = Apriorigen (L[k-2], k) Lk, SUPK = Scand (D, Ck, Minsup Port) 101 Supportdata.update (SUPK) 102 L.append (LK) 103 k + = 1104 return L, SUPPORTDA     TA106 107 def Main (): 108 ' Apriori frequent set retrieval ' 109 L, S = Apriori (Loaddataset ()) 111 Print L113 Print S

Test results:

The last part of the work is to retrieve frequent sets from the data set, and this part is based on the frequent Set Learning association rules.

The last part of the work is graded by the number of elements in the filter set, and this section is graded on the number of elements in the right part of the rule.

It is also important to note that only the association rules within a single frequent set can be retrieved.

Implementation code:

 1 #=================================== 2 # Input: 3 # L: Frequent episode 4 # Supportdata: Support Set 5 # minconf: Minimum confidence Level 6 # Output: 7 # bigrulelist: Rule set 8 #=================================== 9 def generaterules (L, Supportdata, minconf=0.7):             10 ' Learning Association rules from a frequent concentration ' bigrulelist = []13 for i in range (1, Len (L)):-Freqset in L[i]:15 H1 = [Frozenset ([item]) for item in FREQSET]16 if (i > 1): + RULESFROMCONSEQ (freqs ET, H1, Supportdata, Bigrulelist, minconf) else:19 calcconf (Freqset, H1, Supportdata, Bigrul        EList, minconf) return bigrulelist 21 22 #===================================23 # Input: # L: Frequent Episode 25 # Supportdata: Support Set # minconf: Minimum confidence 27 # Output: # bigrulelist: Rule set 29 #================================= ==30 def calcconf (Freqset, H, Supportdata, BRL, minconf=0.7): 31 ' confidence filtering ', Prunedh = []34 for conseq in H:35 conf = Supportdata[freqset]/supportdata[freqset-conseq]36 if Conf >= minconf:37 brl.append ((Freqset-conseq, C ONSEQ, conf)) Prunedh.append (CONSEQ), prunedH41 (Rulesfromconseq, H , Supportdata, BRL, minconf=0.7): 43 ' Learning Association rules from a frequent itemsets ' 44 45 # The number of elements in the right part of this study is set at (m = Len (h[0)) if (Len ( Freqset) > (M + 1)): 48 # Reorganization Rule Right Hmp1 = Apriorigen (H, m+1) 50 # rule Learning Wuyi HMP1 = calcconf ( Freqset, HMP1, Supportdata, BRL, minconf) if (len (HMP1) > 1): 53 # Recursive learning function # RULESFROMC Onseq (Freqset, HMP1, Supportdata, BRL, minconf) (+): 57 ' Association rule learning ', S = Apriori (Loaddataset ()) Rules = Generaterules (L, s) + print rules

Test results:

The test data are: [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]].

The result means that: the probability of 1->3,5->2,2->5 is 1.

This is obviously consistent with the forecast.

Back to the top of the summary

1. Apriori correlation algorithm is used in network shopping website very much, can build the product recommendation system based on this algorithm.

2. However, the Apriori algorithm also has a disadvantage, that is, frequent sets of retrieval speed is not fast enough, because each level is to retrieve the candidate set again (although the candidate set is getting smaller).

3. For the problem in 2, the next article will introduce a more powerful algorithm for discovering frequent sets-fp-growth. (PS: But it cannot be used to discover Association rules)

Principle analysis and code implementation of Apriori correlation analysis algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Principle analysis and code implementation of Apriori correlation analysis algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Principle analysis and code implementation of Apriori correlation analysis algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support