Python-based Apriori algorithm and pythonApriori Algorithm
Apriori algorithm is a basic algorithm in association rules. The association rule mining algorithm proposed by Dr. Rakesh Agrawal and Ramakrishnan Srikant in 1994. Association rules are used to identify the relationship between items in a dataset, also known as Market Basket analysis ), because "shopping Blue Analysis" expresses a subset that is applicable to this algorithm scenario.
For more information about the algorithm, see the following link:
Detailed explanation of the Apriori algorithm
Next, I will share with you how to use the code to implement the Apriori algorithm. The steps are as follows:
1. Create an apriori class
Class Apriori: def _ init _ (self, min_sup = 0.2, dataDic ={}): self. data = dataDic # construct a data record dictionary, for example, {'t800': ['i1', 'i2 ', 'i3', 'i1'],...} self. size = len (dataDic) # Number of Statistics records self. min_sup = min_sup # minimum support threshold self. min_sup_val = min_sup * self. size ## minimum support count
2. filter out items smaller than the minimum support threshold
Def find_frequent_incluitemsets (self): FreqDic ={}# {itemset1: freq1, itemsets2: freq2}, used to count the item's support count for event in self. data: # event indicates each record, for example, T800 for item in self. data [event]: # item is I1, I2, I3, I4, I5 if item in FreqDic: FreqDic [item] + = 1 else: freqDic [item] = 1 L1 = [] for itemset in FreqDic: if FreqDic [itemset]> = self. min_sup_val: # filter out the L1.append ([itemset]) return L1 items smaller than the minimum supported value
3. filter out non-frequent item sets
Def has_infrequent_subset (self, c, L_last, k): # c is the current set, L_last is the set of the previous frequent item set, and k is the number of elements in the current frequent item set, # This function is used to check whether all the subsets of the current set are subsets = list (itertools. combinations (c, k-1) # itertools is the arrangement of composite modules, Objective c decomposition, such as [1, 2, 3] will be divided into [(1, 2), (1, 3), (2, 3)] for each in subsets: each = list (each) # convert tuples to lists if each not in L_last: # return True return False if all subsets are frequent item sets
Note:
Itertools is an arrangement and combination module. For example, list (itertools. combinations ([, 3], 2) can be decomposed into [(), (), ()]
Specific use can refer to: http://www.jb51.net/article/34921.htm
4. merge to form a new frequent item set
Def required ori_gen (self, L_last): # L_last means frequent (k-1) itemsets k = len (L_last [0]) + 1 Ck = [] # for itemset1 in L_last: for itemset2 in L_last: # join step flag = 0 for I in range (K-2): print K-2 if itemset1 [I]! = Itemset2 [I]: flag = 1 # if one of the preceding K-2 items is not equal, the newly merged set cannot be the frequent item set break; if flag = 1: continue if itemset1 [K-2] <itemset2 [K-2]: c = itemset1 + [itemset2 [K-2] else: continue # pruning setp if self. has_infrequen 'T' _ subset (c, L_last, k): # determine whether the subset is a frequent item set continue else: Ck. append (c) return Ck
5. associate analysis iterations form frequent item sets
Def do (self): L_last = self. find_frequent_shortitemsets () # filter out items smaller than the minimum support threshold L = L_last I = 0 while L_last! = []: Ck = self. apriori_gen (L_last) # merge to form a new frequent item set FreqDic ={} for event in self. data: # get all suported subsets for c in Ck: # count the number of new frequent item sets if set (c) <= set (self. data [event]): # determine whether the newly merged frequent project is a subset of data records if tuple (c) in FreqDic: FreqDic [tuple (c)] + = 1 else: freqDic [tuple (c)] = 1 print FreqDic Lk = [] for c in FreqDic: print c print '------ 'if FreqDic [c]> self. min_sup_val: # determine whether the newly formed frequent item set is greater than the minimum support threshold Lk. append (list (c) L_last = Lk L + = Lk return L # L is the set of newly formed frequent item sets.
Test example
Data = {'T100':['I1','I2','I5'],
'T200':['I2','I4'],
'T300':['I2','I3'],
'T400':['I1','I2','I4'],
'T500':['I1','I3'],
'T600':['I2','I3'],
'T700':['I1','I3'],
'T800':['I1','I2','I3','I5'],
'T900':['I1','I2','I3']}
Complete code:
#! -*-Coding: UTF-8-*-import itertoolsclass Apriori: def _ init _ (self, min_sup = 0.2, dataDic ={}): self. data = dataDic # construct a data record dictionary, for example, {'t800': ['i1', 'i2 ', 'i3', 'i1'],...} self. size = len (dataDic) # Number of Statistics records self. min_sup = min_sup # minimum support threshold self. min_sup_val = min_sup * self. size ## minimum support count def find_frequent_shortitemsets (self): FreqDic ={}# {itemset1: freq1, itemsets2: freq2}, used to count the item support count for event in self. d Ata: # event indicates each record, for example, T800 for item in self. data [event]: # item is I1, I2, I3, I4, I5 if item in FreqDic: FreqDic [item] + = 1 else: freqDic [item] = 1 L1 = [] for itemset in FreqDic: if FreqDic [itemset]> = self. min_sup_val: # filter out the L1.append ([itemset]) return L1 def has_infrequent_subset (self, c, L_last, k) of the item that is less than the minimum support threshold. # c is the current set, rochelle last is the set of the previous frequent item set, and k is the number of elements in the current frequent item set, # This function is used to check whether all the subsets of the current set are subsets = list (itertoo Ls. combinations (c, k-1) # itertools is the arrangement of composite modules, Objective c decomposition, such as [1, 2, 3] will be divided into [(1, 2), (1, 3), (2, 3)] for each in subsets: each = list (each) # convert tuples to lists if each not in L_last: # return True return False def into ori_gen (self, rochelle last): # Rochelle last means frequent (k-1) itemsets k = len (L_last [0]) + 1 Ck = [] # for itemset1 in Rochelle last: for itemset2 in Rochelle last: # join step flag = 0 for I in range (K-2): print K-2 if itemse T1 [I]! = Itemset2 [I]: flag = 1 # if one of the preceding K-2 items is not equal, the newly merged set cannot be the frequent item set break; if flag = 1: continue if itemset1 [K-2] <itemset2 [K-2]: c = itemset1 + [itemset2 [K-2] else: continue # pruning setp if self. has_infrequent_subset (c, L_last, k): # determine whether the subset is a frequent item set continue else: Ck. append (c) return Ck def do (self): L_last = self. find_frequent_shortitemsets () # filter out items less than the minimum support threshold L = L_last I = 0 while L_last! = []: Ck = self. apriori_gen (L_last) # merge to form a new frequent item set FreqDic ={} for event in self. data: # get all suported subsets for c in Ck: # count the number of new frequent item sets if set (c) <= set (self. data [event]): # determine whether the newly merged frequent project is a subset of data records if tuple (c) in FreqDic: FreqDic [tuple (c)] + = 1 else: freqDic [tuple (c)] = 1 print FreqDic Lk = [] for c in FreqDic: print c print '------ 'if FreqDic [c]> self. min_sup_val: # determine whether the newly formed frequent item set is greater than the minimum support threshold Lk. append (list (c )) rochelle last = Lk L + = Lk return L # L is the set of new frequent item sets # ******* Test ****** Data = {'t100 ': ['i1', 'i2 ', 'i5'], 't200': ['i2 ', 'i4'], 't300': ['i2 ', 'i3 '], 't400': ['i1', 'i2', 'i4'], 't500 ': ['i1', 'i3'], 't600 ': ['i2', 'i3 '], 't700': ['i1', 'i3 '], 't800': ['i1 ', 'i2 ', 'i3', 'i5 '], 't900': ['i1', 'i2 ', 'i3']} a = Apriori (dataDic = Data) # print. do (). do ()
Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.