Transfer from Mu Chen
Read Catalogue
- Objective
- Some concepts in the field of correlation analysis
- Fundamentals of Apriori Algorithms
- Implementation idea and implementation code of frequent item set retrieval
- The realization and implementation Code of association rule Learning
- Summary
Back to the top of the preface
Presumably everyone has heard the classic story of the field of data mining-the story of "beer and diapers".
So, how is it that the relationship between beer and diapers is dug out of the mass sales information?
This is the task that the correlation analysis is going to accomplish.
This article will explain the most classical Apriori algorithm in the field of association analysis, and give the concrete code implementation.
Back to top some concepts in the field of correlation analysis
1. frequent itemsets: A collection of items that are often present together in a data set. such as "beer and diapers"
2. Association Rules : refers to the existence of a strong relationship between two sets of items. For example, "{beer} and {diaper}" is an association rule.
3. Support : Data set, the amount of data items in an item set is the proportion of the total (in some cases also interpreted as the number of times).
4. credibility : This concept is based on an association rule. It refers to the ratio of support for two sets of items and the degree of support for one of these items, such as "Support (beer, diaper}/support {diaper}").
So, using these to explain the beer and diaper story, that is: {beer, diapers} is a frequent item set; "{beer} and {diaper}" is an association rule; the likelihood of a customer buying a diaper while buying a beer is "support" (Beer, diaper}/support) {diaper}.
What if you want to get all the frequent itemsets with a support level greater than 0.8 for a huge amount of data?
If you use brute force method of statistics, is not realistic, so the amount of computation is too large.
The significance of the Apriori Association analysis algorithm in this paper is to reduce the computational amount of the case greatly and retrieve the association rules efficiently from the frequent itemsets, thus greatly reducing the computational amount that the association rule learning needs to consume.
Back to top Apriori algorithm fundamentals
If {0,1} is a frequent itemsets, then {0} and {1} are also frequent itemsets.
This is clearly the right proposition.
Its inverse no proposition-"if {0} and {1} are not all frequent itemsets, then {0,1} is not a frequent itemsets" is naturally correct. This is one of the core ideas of the Apriori algorithm.
Thus, once a collection is found not to be a frequent itemsets, then all its superset is not a frequent itemsets, so it is not wasted effort to retrieve them.
After retrieving the frequent itemsets, the next step is to retrieve all the required association rules.
If a rule does not meet the minimum confidence requirement, then all subsets of the rule will not be satisfied. This is another part of the core idea of the Apriori algorithm.
PS: It is important to think about how the rules are divided and what is called a subset of the rules.
In this way, as with the previous step, it is also possible to efficiently retrieve association rules from frequent itemsets.
The specific implementation will be divided into frequent sets of search and association rules Learning two parts to explain.
Back to top frequent itemsets retrieval implementation idea and implementation code
A classic implementation method is the "grading method":
The algorithm framework is iterative between the two big steps-filtering/filtering-until all the scenarios have been analyzed.
The result of each filter is to specify the number of elements, so the so-called grading, that is, to discuss the number of specified elements of the set of items.
After filtering, it is filtering.
Filtering has two meanings, and one is that the itemsets must exist in the data set. This is the first layer of filtering; there is also a layer of filtering that refers to the support filtering. Only itemsets that meet the support requirements can be saved.
After filtering, filtering is based on the filter set, and the number of elements per filter is one more than the number of elements that were last filtered.
Then continue filtering. So repeatedly, until the last filter filter is complete.
Pseudo-code implementation:
1 when the number of items in the collection is greater than 0 o'clock: 2 Build a list of candidate itemsets consisting of K items 3 Examine the data to confirm that each itemsets is frequent 4 keep frequent itemsets and build a list of candidate itemsets consisting of k+1 items
The pseudo-code that checks if each item set is frequently part of:
1 pairs per transaction in the DataSet: 2 for each candidate set element: 3 checks whether the dataset element is reserved. 4 for each data set 5 If the support level is up to 6 return filtered frequent itemsets-also filter set
Python implementation and test code:
1 def loaddataset (): 2 ' Return test Data ' 3 4 return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]] 5 6 #======== =========================== 7 # Input: 8 # DataSet: DataSet 9 # Output: # map (Frozenset, C1): Candidate Set 11 #=========== ======================== def createC1 (DataSet): 13 ' Create candidate set ' [] C1 = [] All-in-transaction in dataset : + for item in TRANSACTION:18 if not [item] in c1:19 c1.append ([item]) 20 C1.sort () 22 23 # The returned collection element is the Frozenset type, because it is used later to make the key. return map (Frozenset, C1) 25 26 #============================================= 27 # Input: # D: DataSet (set format) # Ck: Candidate Set # Minsupport: Minimum support 31 # Output: # Retlist: Filter Set # Supportdata: Support Set (when Mining association rules makes #============================================= Scand (D, Ck, Minsupport): 36 ' Get filtered set by candidate set ' 37 38 # Count the number of candidate elements that occur sscnt = {}-for-tid in-d:41 for Can in ck:42 if Can.issubset (TID): An If not Sscnt.has_key (CAN): Sscnt[can]=1 44 Else:sscnt[can] + = 1 45 46 # Build filter set and support set NumItems = float (len (D)) retlist = [] 49 Supportdata = {} for key in sscnt:51 support = Sscnt[key]/numitems-if support >= Minsuppo rt:53 Retlist.insert (0,key) Supportdata[key] = support for the return retlist, SUPPORTD ATA 57 58 #=================================== 59 # Input: # Lk: Filter Set # K: number of elements in current itemsets 62 # output: 63 # Retlist: Candidate Set #=================================== Apriorigen (Lk, K): 66 ' candidate set by filter set ' 67 68 # Recombination filter set, get To the new candidate set. Retlist = [] lenlk = Len (Lk) 73 for I in Range (LENLK): $ for J in range (I+1, LENLK): # Pay attention to the reorganization techniques L1 = List (Lk[i]) [: k-2]; L2 = List (Lk[j]) [: K-2] L1.sort (); L2.sort () 77 If l1==l2:78 retlist.append (Lk[i] | LK[J]) Retlist return 81 82 #============================================= 83 # Input: 84 # DataSet: Data Set # Minsupport: Minimum Support 86 # output: # # L: Frequent set # Supportdata: Support Set (used when mining Association rules) 89 #===== ======================================== def apriori (dataSet, minsupport = 0.5): 91 ' finding frequent itemsets and their corresponding support degrees ' 1 = createC1 (DataSet) 94 D = Map (set, DataSet) L1, Supportdata = Scand (D, C1, minsupport)-L = [L1] 97 K = 2 98 while (Len (l[k-2]) > 0): Ck = Apriorigen (L[k-2], k) Lk, SUPK = Scand (D, Ck, Minsup Port) 101 Supportdata.update (SUPK) 102 L.append (LK) 103 k + = 1104 return L, SUPPORTDA TA106 107 def Main (): 108 ' Apriori frequent set retrieval ' 109 L, S = Apriori (Loaddataset ()) 111 Print L113 Print S
Test results:
Back to top association rules learning implementation Ideas and implementation code
The last part of the work is to retrieve frequent sets from the data set, and this part is based on the frequent Set Learning association rules.
The last part of the work is graded by the number of elements in the filter set, and this section is graded on the number of elements in the right part of the rule.
It is also important to note that only the association rules within a single frequent set can be retrieved.
Implementation code:
1 #=================================== 2 # Input: 3 # L: Frequent episode 4 # Supportdata: Support Set 5 # minconf: Minimum confidence Level 6 # Output: 7 # bigrulelist: Rule set 8 #=================================== 9 def generaterules (L, Supportdata, minconf=0.7): 10 ' Learning Association rules from a frequent concentration ' bigrulelist = []13 for i in range (1, Len (L)):-Freqset in L[i]:15 H1 = [Frozenset ([item]) for item in FREQSET]16 if (i > 1): + RULESFROMCONSEQ (freqs ET, H1, Supportdata, Bigrulelist, minconf) else:19 calcconf (Freqset, H1, Supportdata, Bigrul EList, minconf) return bigrulelist 21 22 #===================================23 # Input: # L: Frequent Episode 25 # Supportdata: Support Set # minconf: Minimum confidence 27 # Output: # bigrulelist: Rule set 29 #================================= ==30 def calcconf (Freqset, H, Supportdata, BRL, minconf=0.7): 31 ' confidence filtering ', Prunedh = []34 for conseq in H:35 conf = Supportdata[freqset]/supportdata[freqset-conseq]36 if Conf >= minconf:37 brl.append ((Freqset-conseq, C ONSEQ, conf)) Prunedh.append (CONSEQ), prunedH41 (Rulesfromconseq, H , Supportdata, BRL, minconf=0.7): 43 ' Learning Association rules from a frequent itemsets ' 44 45 # The number of elements in the right part of this study is set at (m = Len (h[0)) if (Len ( Freqset) > (M + 1)): 48 # Reorganization Rule Right Hmp1 = Apriorigen (H, m+1) 50 # rule Learning Wuyi HMP1 = calcconf ( Freqset, HMP1, Supportdata, BRL, minconf) if (len (HMP1) > 1): 53 # Recursive learning function # RULESFROMC Onseq (Freqset, HMP1, Supportdata, BRL, minconf) (+): 57 ' Association rule learning ', S = Apriori (Loaddataset ()) Rules = Generaterules (L, s) + print rules
Test results:
The test data are: [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]].
The result means that: the probability of 1->3,5->2,2->5 is 1.
This is obviously consistent with the forecast.
Back to the top of the summary
1. Apriori correlation algorithm is used in network shopping website very much, can build the product recommendation system based on this algorithm.
2. However, the Apriori algorithm also has a disadvantage, that is, frequent sets of retrieval speed is not fast enough, because each level is to retrieve the candidate set again (although the candidate set is getting smaller).
3. For the problem in 2, the next article will introduce a more powerful algorithm for discovering frequent sets-fp-growth. (PS: But it cannot be used to discover Association rules)
Principle analysis and code implementation of Apriori correlation analysis algorithm