The core nature of the algorithm: all non-empty sets of frequent itemsets must also be frequent. The reverse proposition is also true: If an item set is non-frequent, then all its superset is not frequent.
First, Apriori algorithm introduction: The Apriori algorithm is a mining association rule frequent itemsets algorithm, its core idea is through the candidate set generation and the plot of the downward closed detection two stages to mining frequent itemsets. Apriori (transcendental, speculative) algorithm is widely used in the analysis of consumer market prices, to guess customers ' consumption habits, intrusion detection technology in the field of network security, can be used in the management of colleges and universities, according to the mining rules may effectively assist the school management departments to carry out poverty-oriented education It can also be used in the field of mobile communication to guide operators ' business operations and decision making of ancillary service providers.
Second, the excavation steps:
1. Find all frequent itemsets (frequency) based on the support level
2. Generate association rules based on confidence level (intensity)
Iii. Basic Concepts
For a->b
① Support:P (a ∩ b), the probability of both a and b
② confidence Level:
P (b| A)the probability p (AB)/P (a) for the simultaneous occurrence of B in the event of a (a) e.g. shopping basket analysis: Milk? Bread
Example: [ support degree:3%, confidence:40%]
Support 3%: means 3% customers buy milk and bread at the same time
Confidence 40%: means customers who buy milk 40% also buy bread
③ if event a contains k elements, then the event A is called K itemsets event a events that meet the minimum support threshold are called frequent k the set of items.
④ rules that meet both the minimum support threshold and the minimum confidence threshold are called strong rules
Iv. Steps of implementation
Apriori algorithm is one of the most influential algorithms for mining Boolean Association rules frequent itemsets apriori Using an iterative approach called layer-wise search, " k-1 k
First, find out frequently "l1 l1 used to find frequent " 2 l2 l2 used to find l3 . This continues until you cannot find the " k lk All require a database scan.
The core idea is: Connecting steps and pruning steps. The connection step is self-connecting, and the principle is to ensure that the previous k-2 entries are the same and are connected in dictionary order. Pruning step is to make all non-empty sets of any frequent itemsets must also be frequent. Conversely, if a
A candidate non-empty set is not frequent, then the candidate is definitely not frequent, so it can be removed from CK .
Simply speaking, style= " Font-family: ' Times New Roman ' ">1 discovering frequent itemsets, the process is (1 ) scan ( 2 ) count ( 3 ) comparison ( 4 Generate frequent Itemsets ( 5 Repeat step (1 ) ~ ( 5
2, The Association rules, the process is: according to the previous definition of the confidence level, the resulting association rules are as follows:
(1) for each frequent itemsets l, all non-empty sets of L are produced;
(2) for Each non-empty set of L S, if
P(L)/P(S) ≧min_conf
Note:l-s represents an itemsets for removing S subsets in itemsets L
The following code implements the Apriori algorithm implementation under a simple data set:
Def loaddataset (): return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]] def createC1 (dataSet): # return C1 Frequent item Set C1 = [] for transaction in Dataset:for item in Transaction:if not [item] in C1: C1.append ([item]) C1.sort () return map (FROZENSET,C1) # frozenset can ' t changed!def Scand (d,ck,minsupport): ss Cnt = {} for Tid in D:for can in Ck:if Can.issubset (TID): If not Sscnt.has_key (CAN): Sscnt[can]=1 else:sscnt[can]+=1 NumItems = float (len (D)) retlist = [] Supportdata = {} for key In Sscnt:support = Sscnt[key]/numitems if support >= minSupport:retList.insert (0, key) supportdata[key]= support return Retlist,supportdata # return result list and support data is a mapdef Aprio Rigen (lk,k): Retlist = [] lenlk = Len (Lk) for I in Range (LENLK): for J in Range (I+1,LENLK): L1 = List (Lk[i]) [: k-2]; L2 = List (Lk[j]) [: K-2] L1.sort (); L2.sort () if L1 = = L2:retList.append (lk[i]| LK[J]) return retlistdef apriori (dataset,minsupport = 0.5): C1 = createC1 (DataSet) D = Map (set, dataset) L1,su Pportdata = Scand (D, C1, minsupport) L = [L1] k = 2 while (len (l[k-2]) >0): Ck = Apriorigen (L[k-2], k) LK,SUPK = Scand (D, Ck, Minsupport) supportdata.update (SUPK) l.append (Lk) k+=1 return L,suppor Tdata if __name__ = = "__main__": "" DataSet = Loaddataset () print (DataSet) C1 = createC1 (DataSet) Print (C1) D = Map (set, DataSet) print (d) l1,supportdata = Scand (d, C1, 0.5) print (L1) print (supportdata) "DataS ET = Loaddataset () L,supportdata = Apriori (dataSet) print (l[1])
Results output: [[Frozenset ([1]), Frozenset ([3]), Frozenset ([2]), Frozenset ([5])], [Frozenset ([1, 3]), Frozenset ([2, 5]), Frozenset ([2, 3]), Frozenset ([3, 5])], [Frozenset ([2, 3, 5])], []]
Correlation analysis and Pyhon implementation of Apriori algorithm